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MATERIALS AND METHODS RELATING TO CANCER DIAGNOSIS 



The present invention concerns materials and methods for 
diagnosing cancer, especially breast cancer. Particularly, 
but not exclusively, the invention relates to methods and 
kits for diagnosing the presence or risk of breast cancer 
using genetic identifiers. 

Carcinoma of the breast is one of the leading causes of 
death and major illness amongst female populations 
worldwide. Despite rapid advances in understanding the 
molecular and genetic events that underlie breast 
carcinogenesis and the introduction of clinical screening 
programs, morbidity and mortality due to this disease 
unfortunately still remains at an unacceptably high level. 
Indeed, for many parts of the world, breast cancer remains 
one of the fastest growing cancers in local female 
populations (Chia et al . , 2000). One major challenge in the 
diagnosis and treatment of breast cancer is its clinical 
and molecular heterogeneity. Individual breast cancers can 
exhibit tremendous variations in clinical presentation, 
disease aggressiveness, and treatment response (Tavassoli 
and Schitt, 1992), suggesting that this clinical entity may 
actually represent a conglomerate of many different and 
distinct cancer subtypes. In addition to variations in 
clinical behaviour, breast cancer can also display 
strikingly distinct patterns of incidence in different 
regional and ethnic populations. For example, in Caucasian 
populations, the majority of breast cancers occurs in post- 
menopausal women at a mean and median age of 60 and 61 
respectively (Giuliano, 1998) . In contrast, studies in 
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Asian populations show a bi-modal age of incidence pattern 
beginning at age 40 (Chia et al., 2000, see discussion). 
Thus, one outstanding question in tumour biology is to 
explain these regional and ethnic differences on the basis 
5 of genetic or environmental factors, and to ascertain if 

research findings obtained using Caucasian populations can 
be clinically translated to other ethnic populations as 
well. 

10 Expression profiling using DNA microarrays has recently 

proved to be an extremely powerful and versatile approach 
towards the investigation of multiple aspects of tumour 
biology. Previous reports using microarrays on breast 
cancers have focused on the identification of novel tumour 

15 subtypes, or on the identification of genes that are 

differentially expressed between known cancer subgroups 
(Perou et al., 2000, Gruvberger et al., 2001, Hedenfalk et 
al., 2001). However, because these studies have primarily 
focused on samples obtained primarily from Caucasian 

20 populations, it is thus an open question if the findings 
described in these reports will also apply to breast 
cancers from other ethnic populations. There are also many 
other key issues also need to be addressed before the use 
of molecular profiling can become a clinical reality. For 

25 instance, there are at present almost no published reports 

where the expression signatures and molecular subtypes 
defined in one institution's study have been independently 
confirmed in a separate series from another centre. Such 
validations are obviously essential, however, as different 
30 health-care institutions are likely to differ in multiple 

ways which may affect the expression profile of the tumor 
being studied, such as in the surgical handling of tumor 
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samples, choice of array technology platform, and patient 
population base. In addition, because it is usually 
unfeasible to sample the same tumor over an extended period 
of time, it is often unclear if the different subtypes 
defined using these approaches truly represent distinct 
biological entities, or if they represent a single tumor 
class in different stages of clinical evolution. As one 
example, there are currently conflicting opinions and data 
in the field on whether estrogen receptor negative (ER -) 
breast cancers represent biological entities that have 
directly arisen from an ER - progenitor cell type in the 
breast epithelia, or if they have 'evolved' from an 
originally ER+ state (Kuukasjarri et al . , 1996; Pari 2000; 
Gruvberger et al, 2001) . 

To address these issues, the inventors have embarked upon a 
large-scale expression profiling project of breast tumours 
derived from Asian patients. First, using a combination of 
supervised and unsupervised clustering methods, they have 
been able to define a small set of genes which when used in 
combination serves as a 'genetic identifier' to distinguish 
if an unknown breast sample is either normal or malignant 
in a patient of ethnic Chinese descent. The use of such 
'genetic identifiers' is of considerable use in the 
development of molecular diagnostic assays for specific 
patient populations. Second, using principal component 
analysis (PCA) , the inventors show that the expression 
profiles of normal breast tissues are considerably less 
varied than tumour profiles. This finding supports current 
models of breast tumourigenesis, in which to a first 
approximation normal breast tissues can be thought of as a 
relatively constant Aground state', and that the widely 
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varying expression profiles associated with individual 
tumours are probably indicative of their arising from this 
'ground state' through many different and highly distinct 
tumourigenic pathways. 

Third, by comparing the expression profiles of a series of 
invasive breast cancers from Chinese patients to published 
reports using patient samples of primarily Caucasian 
origin, they found that despite several inter-study 
methodological differences including choice of array 
technology platform, many of the key gene signatures and 
molecular subtypes were remarkably conserved between the 
two patient populations, suggesting that the molecular 
subtypes defined using expression-based genomics are indeed 
highly robust. To the inventors' knowledge, this is the 
first cross-institution validation study of this type 
reported for breast cancer. 

Fourth, by studying the expression profiles of a series of 
ductal in-situ cancers (ductal carcinoma in situ, or DCIS) , 
they also found that DCIS tumors express many of the 
'hallmark' subtype-specific expression signatures 
associated with their invasive counterparts. Since DCIS 
cancers currently represent the earliest non-invasive 
malignant lesion detectable by conventional histopathology, 
these results suggest that the molecular subtypes defined 
in these studies probably arise at a relatively early stage 
of tumorigenesis (ie pre-invasive ) and represent distinct 
biological entities, rather than a single cancer class in 
different stages of evolution. 
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Besides providing a molecular framework for the temporal 
progression of breast cancer, the inventors" results also 
support the feasibility of using expression-based genomic 
technologies for clinical cancer diagnosis and 
classification' across different health-care institutions. 

Thus, at its most general, the present invention provides a 
new diagnostic assay for determining the presence or risk 
of cancer, particularly breast cancer, in a patient using 
specific genetic identifiers. Further, the inventors have 
determined a series of multi-gene classifiers for breast 
cancer. 

In the first instance, the inventors have determined a set 
of 20 genes (a "genetic identifier") which may be used in 
combination to predict if an unknown breast tissue sample 
is either normal or malignant. 

In addition to this first geneset (which can distinguish 
between tumor and normal breast samples), the inventors 
have also determined other genesets which, can be used as 
genetic identifiers to classify tumour samples as to 
subtype. This is of great importance, not only from a 
research standpoint, but also to ensure the most 
appropriate treatment is provided. 

Thus, the inventors have determined the following genesets 
which may be used to predict the presence of breast tumour 
and/or the class of tumour. 

1) The geneset provided in Table 2, which when used as a 
combination, allows a user to predict if an unknown 
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breast tissue sample is either normal or malignant, 
particularly using spotted cDNA microarrays . 

2) A further set of genes (Table 4a and 4b) which when used 
in combination can also be used to distinguish between 
normal and tumour breast tissue samples. This geneset 
is more preferably used on expression profiles obtained 
using a commercially available technology platform such 
as genechips, e.g. Affymetrix U133A Genechips, but can 
also be utilized employing the spotted cDNA microarray 
technology described in 1) . 

3) A set of genes (Table 5a) which when used in combination 
can predict the Estrogen Receptor status of a confirmed 
breast tumour sample. A second set of genes (Table 5b) 
which when used in combination can predict the ERBB2 
status of a confirmed breast tumour sample. 

4) A set of genes (Table 6) which when used in combination 
can be used to predict the "molecular subtype" of a 
breast tumour sample according to the following 5 
categories: Luminal, Basal, ERBB2 , Normal-like, and ER- 
negative subtype II. In this embodiment of the present 
invention, the inventors have used two different types 
of classification algorithms, namely, (1) one-vs-all 
(OVA) support vector machines (SVM) ; and (2) genetic 
algorithm (GA/maximum likelihood discriminant (MLHD) 
analysis. Different sets of genes are optimally used 
depending upon the type of classification algorithm 
used. Thus, distinct sets of genes are described below 
for each part. 
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5) A set of genes (Table 7) which when used in combination 
can be used to predict luminal subclass in Asian breast 
cancer patients. The inventors have determined that 
breast tumours of the "luminal" variety can be "split" 
5 into two distinct subtypes Luminal A and Luminal D which 

are clinically relevant. The genetic identifier (Table 
7) is therefore preferably used after the tumour has 
been formally recognised as "luminal" in nature. This 
of course, can be achieved using the multi-class 
10 predictor of Table 6. The Luminal D tumours are 

associated with certain expression signatures that are 
also found highly aggressive non-Luminal tumours, e.g. 
ERBB2 and Basal. This supports the clinical importance 
of knowing the tumour subtype. 

15 

The determination of specific genesets (genetic 
identifiers) allows tissue samples to be classified (e.g. 
tumour v normal) according to the expression pattern of 
those genes in the tissue. For example, in the first 
20 genetic identifier (tumor vs normal) the inventors have 

determined 10 genes that are usually up-regulated in tumour 
cells relative to normal cells and 10 genes that are 
usually down-regulated in tumour cells relative to normal 
cells. By studying the expression pattern of these 
25 particular genetic identifiers, i.e. the composite levels 

of expression products of these genes in a test sample, it 
is possible to classify the sample as malignant or normal. 
Thus, the expression products are able to provide an 
expression profile or "fingerprint" that can serve to 
30 distinguish between normal and malignant cells. 
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In a first aspect of the present invention, there is 
provided a method of creating a nucleic acid expression 
profile for a breast tumour cell comprising the steps of 

(a) isolating expression products from said breast 
tumour cell and a normal breast cell; 

(b) identifying the expression profile of a plurality 
of genes selected from Table 2; for both the tumour and 
normal cell; 

(c) comparing the expression profile of the tumour 
cell and the normal cell; and 

(d) determining a nucleic acid expression profile 
characteristic of a breast tumour cell. 

For the purposes of diagnosis, it is important to obtain an 
expression profile that is characteristic of a tumour cell, 
i.e. distinct from the expression profile of the equivalent 
normal cell. The method according to the first aspect 
determines the expression profile of a plurality of genes 
identified by the inventors to be a "genetic identifier" of 
breast tumour cells (see Table 2) . 

The expression profile of the individual genes that 
comprise the genetic identifier will differ slightly 
between independent samples. However, the inventors have 
realised that the expression profile of these particular 
genes that comprise the genetic identifier when used in 
combination provide a characteristic pattern of expression 
(expression profile) in a tumour cell that is recognisably 
different from that in a normal cell. 

By creating a number of expression profiles of the genetic 
identifier from a number of known tumour or normal samples, 
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it is possible to create a library of profiles for both 
normal and tumour samples. The greater the number of 
expression profiles, the easier it is to create a reliable 
characteristic expression profile standard (i.e. including 
statistical variation) that can be used as a control in a 
diagnostic assay. Thus, a standard profile may be one that 
is devised from a plurality of individual expression 
profiles and devised within statistical variation to 
represent either the tumour or normal cell profile. 

Thus, the method according to the first aspect of the 
invention comprises the steps of 

(a) isolating expression products from a breast 
tumour cell; contacting said expression products with a 
plurality of binding members capable of specifically and 
independently binding to expression products of a plurality 
of genes selected from Table 2, so as to create a first 
expression profile of a tumour cell; 

(b) isolating expression products from a normal 
breast cell; contacting said expression products with the 
plurality of binding members used in step (a), so as to 
create a comparable second expression profile of a normal 
breast cell; 

(c) comparing the first and second expression 
profiles to determine an expression profile characteristic 
of a breast tumour cell. 

The expression products are preferably mRNA, or cDNA made 
from said mRNA. Alternatively, the expression product 
could be an expressed polypeptide. Identification of the 
expression profile is preferably carried out using binding 
members capable of specifically identifying the expression 
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products of genes identified in Table 2. For example, if 
the expression products are cDNA then the binding members 
will be nucleic acid probes capable of specifically 
hybridising to the cDNA. 

Preferably, either the expression product or the binding 
member will be labelled so that binding of the two 
components can be detected. The label is preferably chosen 
so as to be able to detect the relative levels/quantity 
and/or absolute levels/quantity of the expressed product so 
as to determine the expression profile based on the up- 
regulation or down-regulation of the individual genes that 
comprise the genetic identifiers. In other words, it is 
preferable that the binding members are capable of not only 
detecting the presence of an expression product but its 
relative abundance (i.e. the amount of product available). 

The determination of the nucleic acid expression profile 
may be computerised and may be carried out within certain 
previously set parameters, to avoid false positives and 
false negatives. 

The computer may then be able to provide an expression 
profile standard characteristic of a normal breast cell and 
a malignant breast cell as discussed above. The determined 
expression profiles may then be used to classify breast 
tissue samples as normal or malignant as a way of 
diagnosis . 

Thus, in a second aspect of the invention, there is 
provided an expression profile database comprising a 
plurality of gene expression profiles of both normal and 
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malignant breast cells where the genes are selected from 
Table 2; retrievably held on a data carrier. Preferably, 
the expression profiles making up the database are produced 
by the method according to the first aspect. 

5 

With the knowledge of the particular genetic identifiers, 
it is possible to devise many methods for determining the 
expression pattern or profile ■■of the genes in a particular 
test sample of cells. For example, the expressed nucleic 

10 acid (RNA, mRNA) can be isolated from the cells using 

standard molecular biological techniques. The expressed 
nucleic acid sequences corresponding to the gene members of 
the genetic identifiers given in Table 2 can then be 
amplified using nucleic acid primers specific for the 

15 expressed sequences in a PCR. If the isolated expressed 

nucleic acid is mRNA, this can be converted into cDNA for 
the PCR reaction using standard methods. 

The primers may conveniently introduce a label into the 
20 amplified nucleic acid so that it may be identified. 

Ideally, the label is able to indicate the relative 
quantity or proportion of nucleic acid sequences present 
after the amplification event, reflecting the relative 
quantity or proportion present in the original test sample. 
25 For example, if the label is fluorescent or radioactive, 

the intensity of the signal will indicate the relative 
quantity/proportion or even the absolute quantity, of the 
expressed sequences. The relative quantities or proportions 
of the expression products of each of the genetic 
30 identifiers will establish a particular expression profile 

for the test sample. By comparing this profile with known 
profiles or standard expression profiles, it is possible to 
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determine whether the test sample was from normal breast 
tissue or malignant breast tissue. 

Alternatively, the expression pattern or profile can be 
determined using binding members capable of binding to the 
expression products of the genetic identifiers, e.g. mRNA, 
corresponding cDNA or expressed polypeptide. By labelling 
either the expression product or the binding member it is 
possible to identify the relative quantities or proportions 
of the expression products and determine the expression 
profile of the genetic identifiers. In this way the sample 
can be classified as normal or malignant by comparison of 
the expression profile with known profiles or standards. 
The binding members may be complementary nucleic acid 
sequences or specific antibodies. Microarray assays using 
such binding members are discussed in more detail below. 

in a third aspect of the present invention, there is provided 
a method for determining the presence or risk of breast 
cancer in a patient comprising the steps of 

(a) obtaining expression products from breast tissue 
cells obtained from a patient suspected of having or at risk 
of having breast cancer; 

(b) contacting said expression products with one or more 
binding members capable of detecting the presence of an 
expression product corresponding to one or more genes 
identified in Table 2; and 

(c) determining the presence or risk of breast cancer in 
said patient based on the binding profile of the expression 
products from the breast tissue cells to the one or more 
binding members. 
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The patient is preferably a woman of Asian descent, e.g. 
ethnic Chinese descent. 

The step of determining the presence or risk of breast cancer 
may be carried out by a computer which is able to compare the 
binding profile of the expression products from the breast 
tissue cells under test with a database of other previously 
obtained profiles and/or a previously determined "standard" 
profile which is characteristic of the presence or risk of 
the tumour. The computer may be programmed to report the 
statistical similarity between the profile under test and the 
standard profiles so that a diagnosis may be made. 

As mentioned above, the present inventors have identified 
several key genes which have a different expression pattern 
in tumour cells as opposed to normal cells of the breast. 
Collectively, these genes comprise a 'genetic identifier' . 
The inventors have shown (see below) that the combinatorial 
expression pattern of the genes belonging to the "genetic 
identifier" serves to distinguish between normal and tumour 
cells. Thus, by detecting the expression pattern of the 
genetic identifier in a breast tissue sample, it is possible 
to predict the state of the cell (normal or malignant) and 
whether that patient has or is at risk of developing breast 
cancer. 

The genes that comprise the genetic identifier are given in 
Table 2. There are 20 genes shown, 10 of which are commonly 
highly expressed in tumour cells relative to normal cells and 
10 of which commonly have decreased expression in tumour 
cells relative to normal cells. The differential expression 
of the genes was determined using tumour biopsies and normal 
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tissue biopsies. By detecting the levels of expression 
products of these genes in a test sample, it is possible to 
classify the cells as normal or malignant based on the 
expression profile produced, i.e. an increase or decrease in 
their expression, relative to a standard pattern or profile 
seen in normal cells. 

Thus, in a further aspect of the invention, there is provided 
a method of classifying a sample of breast tissue as normal 
or malignant, said method comprising the steps of 

a) obtaining expression products from the cells of the 
breast tissue sample; 

b) contacting said expression products with a 
plurality of binding members capable of specifically binding 
to the expression products of a plurality of genes selected 
from Table 2; and 

c) classifying the sample as normal or malignant based 
on the binding profile of the expression products from the 
sample and the binding members. 

The sample of breast tissue is preferably from a woman of 
Asian descent, e.g. ethnic Chinese descent. 

As before, the expression product may be a transcribed 
nucleic acid sequence or the expressed polypeptide. The 
transcribed nucleic acid sequence may be RNA or mRNA. The 
expression product may also be cDNA produced from said mRNA. 

The binding member may a complementary nucleic acid sequence 
which is capable of specifically binding to the transcribed 
nucleic acid under suitable hybridisation conditions. 
Typically, cDNA or oligonucleotide sequences are used. 
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Where the expression product is the expressed protein, the 
binding member is preferably an antibody, or molecule 
comprising an antibody binding domain, specific for said 
expressed polypeptide. 

The binding member may be labelled for detection purposes 
using standard procedures known in the art. Alternatively, 
the expression products may be labelled following isolation 
from the sample under test. A preferred means of detection 
is using a fluorescent label which can be detected by a light 
meter. Alternative means of detection include electrical 
signalling. For example, the Motorola e-sensor system has 
two probes, a "capture probe" which is freely floating, and a 
"signalling probe" which is attached to a solid surface which 
doubles as an electrode surface. Both probes function as 
binding members to the expression product. When binding 
occurs, both probes are brought into close proximity with 
each other resulting in the creation of an electrical signal 
which can be detected. 

As discussed above, the binding members may be 
oligonucleotide primers for use in a PCR (e.g. multi-plexed 
PCR) to specifically amplify the number of expressed products 
of the genetic identifiers. The products would then be 
analysed on a gel. However, preferably, the binding member a 
single nucleic acid probe or antibody fixed to a solid 
support.. The expression products may then be passed over the 
solid support, thereby bringing them into contact with the 
binding member. The solid support may be a glass surface, 
e.g. a microscope slide; beads (Lynx); or fibre-optics. In 
the case of beads, each binding member may be fixed to an 
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individual bead and they are then contacted with the 
expression products in solution. 

Various methods exist in the art for determining expression 
profiles for particular gene sets and these can be applied to 
the present invention. For example, bead-based approaches 
(Lynx) or molecular bar-codes (Surromed) are known 
techniques. In these cases, each binding member is attached 
to a bead or "bar-code" that is individually readable and 
free-floating to ease contact with the expression products. 
The binding of the binding members to the expression products 
(targets) is achieved in solution, after which the tagged 
beads or bar-codes are passed through a device (e.g. a flow- 
cytometer) and read. 

A further known method of determining expression profiles is 
instrumentation developed by Illumina, namely, fibre-optics. 
In this case, each binding member is attached to a specific 
"address" at the end of a fibre-optic cable. Binding of the 
expression product to the binding member may induce a 
fluorescent change which is readable by a device at the other 
end of the fibre-optic cable. 

The present inventors have successfully used a nucleic acid 
microarray comprising a plurality of nucleic acid sequences 
fixed to a solid support. By passing nucleic acid sequences 
representing expressed genes e.g. cDNA, over the microarray, 
they were able to create an binding profile characteristic of 
the expression products from tumour cells and normal cells 
derived from breast tissue. 
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The present invention further provides a nucleic acid 
microarray for classifying a breast tissue sample as 
malignant or normal comprising a solid support housing a 
plurality of nucleic acid sequences, said nucleic acid 
5 sequences being capable of specifically binding to expression 

products of one or more genes identified in Table 2. The 
classification of the sample will lead to the diagnosis of 
breast cancer in a patient. Preferably the solid support 
will house nucleic acid sequences being capable of 

10 specifically and independently binding to expression products 

of at least 5 genes, more preferably, at least 10 genes or at 
least 15 genes identified in Table 2. In a most preferred 
embodiment, the solid support will house nucleic acid 
sequences being capable of specifically and independently 

15 binding to expression products of all 20 genes identified in 

Table 2. 

Typically, high density nucleic acid sequences, usually cDNA 
or oligonucleotides, are fixed onto very small, discrete 

20 areas or spots of a solid support. The solid support is 

often a microscopic glass side or a membrane filter, coated 
with a substrate (or chips) . The nucleic acid sequences are 
delivered (or printed) , usually by a robotic system, onto the 
coated solid support and then immobilized or fixed to the 

25 support. 

In a preferred embodiment, the expression products derived 
from the sample are labelled, typically using a fluorescent 
label, and then contacted with the immobilized nucleic acid 
30 sequences. Following hybridization, the fluorescent markers 

are detected using a detector, such as a high resolution 
laser scanner. In an alternative method, the expression 
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products could be tagged with a non-fluorescent label, e.g. 
biotin. After hybridisation, the microarray could then be 
^stained' with a fluorescent dye that binds /bonds to the 
first non-fluorescent label (e.g. f luorescently labelled 
strepavidin, which binds to biotin) . 

A binding profile indicating a pattern of gene expression 
(expression pattern or profile) is obtained by analysing the 
signal emitted from each discrete spot with digital imaging 
software. The pattern of gene expression of the experimental 
sample can then be compared with that of a control (i.e. an 
expression profile from a normal tissue sample) for 
differential analysis. 

As mentioned above, the control or standard, may be one or 
more expression profiles previously judged to be 
characteristic of normal or malignant cells. These one or 
more expression profiles may be retrievable stored on a data 
carrier as part of a database. This is discussed above. 
However, it is also possible to introduce a control into the 
assay procedure. In other words, the test sample may be 
"spiked" with one or more "synthetic tumour" or "synthetic 
normal" expression products which can act as controls to be 
compared with the expression levels of the genetic 
identifiers in the test sample. 

Most microarrays utilize either one or two f luorophores . For 
two-colour arrays, the most commonly used f luorophores are 
Cy3 (green channel excitation) and Cy5 (red channel 
excitation) . The object of the microarray image analysis is 
to extract hybridization signals from each expression 
product. For one-color arrays, signals are measured as 
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absolute intensities for a given target (essentially for 
arrays hybridized to a single sample). For two-colour arrays, 
signals are measured as ratios of two expression products, 
(e.g. sample and control (controls are otherwise known as a 
'reference' ) ) with different fluorescent labels. 

The microarray in accordance with the present invention 
preferably comprises a plurality of discrete spots, each spot 
containing one or more oligonucleotides and each spot 
representing a different binding member for an expression 
product of a gene selected from Table 2 . In a preferred 
embodiment, the microarray will contain 2 0 spots for each of 
the 20 genes provided in Table 2. Each spot will comprise a 
plurality of identical oligonucleotides each capable of 
binding to an expression product, e.g. mRNA or cDNA, of the 
gene of Table 2 it is representing. 

In a still further aspect of the present invention, there is 
provided a kit for classifying a breast tissue sample as 
normal or malignant, said kit comprising one or more binding 
members capable of specifically binding to an expression 
product of one or more genes identified in Table 2, and a 
detection means. 

Preferably, the one or more binding members (antibody binding 
domains or nucleic acid sequences e.g. oligonucleotides) in 
the kit are fixed to one or more solid supports e.g. a single 
support for microarray or fibre-optic assays, or multiple 
supports such as beads. The detection means is preferably a 
label (radioactive or dye, e.g. fluorescent) for labelling 
the expression products of the sample under test. The kit 
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may also comprise means for detecting and analysing the 
binding profile of the expression products under test. 

Alternatively, the binding members may be nucleotide primers 
capable of binding to the expression products of the genes 
identified in Table 2 such that they can be amplified in a 
PGR. The primers may further comprise detection means, i.e. 
labels that can be used to identify the amplified sequences 
and their abundance relative to other amplified sequences. 

The kit may also comprise one or more standard expression 
profiles retrievably held on a data carrier for comparison 
with expression profiles of a test sample. The one or more 
standard expression profiles may be produced according to 
the first aspect of the present invention. 

The present invention further provides a method of 
diagnosing the presence or risk of breast cancer in a 
patient of Asian descent, said method comprising 

obtaining a breast tissue sample; 

isolating expression products from said sample; 

labelling said expression products; 

contacting said labelled expression products with a 
plurality of binding members representing a plurality of 
genes selected from Table 2; 

determining the presence or risk of breast cancer in 
said patient, based on the binding profile of said labelled 
expression products and the binding members. 

The breast tissue sample may be obtained as excisional 
breast biopsies or fine-needle aspirates. 
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Again, the expression products are preferably mRNA or cDNA 
produced from said mRNA. The binding members are 
preferably oligonucleotides fixed to one or more solid 
supports in the form of a microarray or beads (see above). 
The binding profile is preferably analysed by a detector 
capable of detecting the label used to label the expression 
products. The determination of the presence or risk of 
breast cancer can be made by comparing the binding profile 
of the sample with that of a control e.g. standard 
expression profiles. 

In all of the aspects described above, it is preferred to 
use binding members capable of specifically binding (and, 
in the case of nucleic acid primers, amplifying) expression 
products of all 20 genetic identifiers. This is because 
the expression levels of all 20 genes make up the 
expression profile specific for the cells under test. The 
classification of the expression profile is more reliable 
the greater number of gene expression levels tested. Thus, 
preferably expression levels of more than 5 genes selected 
from Table 2 are assessed, more preferably, more than 10, 
even more preferably, more than 15 and most preferably all 
20 genes. 

The genetic identifier (Table 2) mentioned above is 
particularly suitable for spotted cDNA microarray 
technology where the microarray (or other similar 
technology) has been created specifically for this purpose. 
However, the present inventors have appreciated that the 
present invention may be modified so that commercially 
available genechips may be used, rather than going to the 
trouble of creating one specifically containing the genes 
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identified in Table 2. With this in mind, the inventors 
have identified a further genetic identifier (Table 5a or 
5b) which, although it may be utilized using microarray 
technology described above, it may also be used on 
5 commercially available genechips, e.g. Affymetrix U133A 

Genechips . 

Thus, the aspects of the invention described above may also 
be carried out using the geneset of Table 4a or 4b instead 
10 of that of Table 2 and in addition these may be used on 

either on commercially available genechips such as 
Affymetrix U133A Genechips, or using microarray technology 
described above. 

15 The present inventors have also identified a further set of 

genes (Table 5a) which may be used to classify a breast 
tumour on the basis of the Estrogen Receptor (ER) status. 
This is clinically important as ER + tumours can be treated 
with hormonal therapies (e.g. tamoxifen) and ER" tumours are 

20 typically more aggressive and refractory to treatment. 

Likewise, the present inventors have also identified a 
further set of genes (Table 5b) which may be used to 
classify a breast tumour on the basis of the ERBB2+ status. 
25 Knowing the ERBB2 + status of a breast tumour is also 

clinically important as ERBB2 + tumours are typically highly 
aggressive and carry a poor clinical prognosis. ERBB2+ 
tumors are also candidates for treatment with Herceptin (an 
anti-cancer drug) . 

30 

The genesets provided in Tables 5a and 5b were determined 
by generating expression profiles for a set of breast 
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tumour samples using Affymetrix U133A Genechips . A series 
of statistical algorithms were used to identify a set of 
genes that were differentially expressed in ER + vs EFT 
samples as well as ERBB2 + vs ERBB2" samples. Accordingly, 
5 the present invention further provides genesets which may 

be used in methods of classifying breast tumours according 
to ER and ERBB2 status. 

Thus, in a further aspect of the present invention, there 
10 is provided a method of classifying a breast tumour 

according to its ER and/or ERBB2 status comprising. 

a) obtaining expression products from the tumour 
cells; 

b) contacting said expression products with a 
15 plurality of binding members capable of specifically 

binding to the expression products of a plurality of genes 
selected from Table 5; and 

c) classifying the tumour cell on the basis of ER 
and/or ERBB2 status based on the binding profile of the 

20 expression products from the sample and the binding 

members . 

As with the first aspect of the present invention, the 
plurality of binding members are preferably nucleic acid 
25 sequences and more preferably nucleic acid sequences fixed 

to a solid support, for example as a nucleic acid 
microarray. The nucleic acid sequences may be 
oligonucleotide probes or cDNA sequences. 

30 The tumour cell may be classified according to its ER 

and/or ERBB2 status on the basis of the expression of the 
genes identified in Table 5. Table 5 identifies each gene 
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or down regulated (-) in an 
information, it is possible 
tumour cell under test is ER- 



AS with all aspects of the present invention, the plurality 
of genes selected from the determined genesets (Tables 2-7 
with the exception of Table 6b) may vary in actual number. 
It is preferable to use at least 5 genes, more preferably at 
least 10 genes in order to carry out the invention. Of 
course, the known microarray and genechip technologies allow 
large numbers of binding members to be utilized. Therefore, 
the more preferred method would be to use binding members 
representing all of the genes in each geneset. However, the 
skilled person will appreciate that a proportion of these 
genes may be omitted and the method still carried out in a 
reliable and statistically accurate fashion. In most cases, 
it would be preferable to use binding members representing 
at least 70%, 80% or 90% of the genes in each respective 
geneset. 

In a further aspect of the invention, there is provided a 
method of classifying a breast tumour cell as to its 
molecular subtype comprising 

a) obtaining expression products from the tumour 
cells; 

b) contacting said expression products with a 
plurality of binding members capable of specifically binding 
to the expression products of a plurality of genes selected 
from Table 6; and 

c) classifying the tumour cell with regard to its 
molecular subtype based on the binding profile of the 



24 



WO 03/070979 



PCT/GB03/00755 



expression products from the tumour cell and the binding 
members . 

The molecular subtypes are preferably Luminal, ERBB2 , 
Basal, ER-type II and Normal/normal like. These sub-types 
are defined in the following text. 

In practice, the expression profile of the tumour sample to 
be classified is determined using the genesets described in 
Table 6 (Table 6a or 6b depends on the type of 
classification algorithm used) . Secondly, the expression 
profile would be compared to a database of "references" 
(control profiles, where each "reference" (control) 
profiles, where each "reference" profile corresponds to the 
"average" tumour belonging to that particular molecular 
type. In this case, rather than just having normal and 
tumour, or ER + and ER", the "reference" profiles will 
correspond to five distinct subtypes. Third, by using a 
suitable classification algorithm, the unknown tumour 
sample can be assigned to the specific subtype for which 
the expression profile finds a good reference match. 

Where the plurality of binding members are selected as 
being capable of binding to the' expression products of a 
plurality of genes from Table 6a, the number of binding 
members used will govern the reliability of the test. In 
other words, it is not necessary to use binding members 
capable of specifically and independently to all genes 
identified in Table 6a, but the more binding members used, 
the better the test. Therefore, by plurality it is meant 
preferably at least 50%, more preferably at least 70% and 
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even more preferably at least 90% of the genes as mentioned 
above. 

In a still further aspect of the invention, there is 
provided a method of further sub-classifying a breast 
tumour cell as either luminal A or luminal D subtype 
comprising 

a) obtaining expression products from the tumour 
cells; 

(b) contacting said expression products with a 
plurality of binding members capable of specifically 
binding to the expression products of a plurality of genes 
selected from Table 7; and 

c) classifying the tumour cell with regard to its 
molecular subtype based on the binding profile of the 
expression products from the tumour cell and the binding 
members . 

Preferably, the method is carried out on expression 
products obtained from a breast tumour cell which has 
already been classified as "luminal", e.g. using the 
genetic identifier of Table 6a or 6b. 

With regard to the geneset provided in Table 6b, it is 
preferable that all of the genes in the geneset are used 
for classification. The reduction in the number of genes 
will take away the likelihood of a reliable result. This 
is because this geneset is selected using the genetic 
algorithm approach. 

The inventors have provided a number of genetic identifiers 
(Tables 2 to 7) which can be used to diagnose and/or 
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predict risk of breast cancer and, further, can be used to 
classify the type of breast cancer, particularly for women 
of Asian descent. 

The provision of these genetic identifiers allows 
diagnostic tools, e.g. nucleic acid microarrays to be 
custom made and used to predict, diagnose or subtype 
tumours. Further, such diagnostic tools may be used in 
conjunction with a computer which is programmed to 
determine the expression profile obtained using the 
diagnostic tool (e.g. microarray) and compare it to a 
"standard" expression profile characteristic of normal v 
tumour and/or molecular subtypes depending on the 
particular genetic identifier used. In doing so, the 
computer not only provides the user with information which 
may be used diagnose the presence or type of a tumour in a 
patient, but at the same time, the computer obtains a 
further expression profile by which to determine the 
"standard " expression profile and so can update its own 
database. 

Thus, the invention allows, for the first time, specialized 
chips (microarrays) to be made containing probes 
corresponding to the genesets identified in Tables 2 to 7 . 
The exact physical structure of the array may vary and 
range from oligonucleotide probes attached to a 2- 
dimensional solid substrate to free-floating probes which 
have been individually "tagged" with a unique label, e.g. 
"bar code". 

A database corresponding to the various biological 
classifications (e.g. normal, tumour, molecular subtype 
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etc.) may be created which will consist of the expression 
profiles of various breast tissues as determined by the 
specialized microarrays. The database may then be 
processed and analysed such that it will eventually contain 
(i) the numerical data corresponding to each expression 
profile in the database, (ii) a "standard" profile which 
functions as the canonical profile for that particular 
classification; and (iii) data representing the observed 
statistical variation of the individual profiles to the 
"standard" profile. 

In practice, to evaluate a patient's sample, the expression 
products of that patient's breast cells (obtained via 
excisional biopsy or find needle aspirate) will first be 
isolated, and the expression profile of that cell 
determined using the specialized microarray. To classify 
the patient's sample, the expression profile of the 
patient's sample will be queried against the database 
described above. Querying can be done in a direct or 
indirect manner. The "direct" manner is where the 
patient's expression profile is directly compared to other 
individual expression profiles in the database to 
determined which profile (and hence which classification) 
delivers the best match. Alternatively, the querying may 
be done more "indirectly", for example, the patient 
expression profile could be compared against simply the 
"standard" profile in the database. The advantage of the 
indirect approach is that the "standard" profiles, because 
they represent the aggregate of many individual profiles, 
will be much less data intensive and may be stored on a 
relatively inexpensive computer system which may then form 
part of the kit (i.e. in association with the microarrays) 
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in accordance with the present invention. In the direct 
approach, it is likely that the data carrier will be of a 
much larger scale (e.g. a computer server) as many 
individual profiles will have to be stored. 

By comparing the patient expression profile to the standard 
profile (indirect approach) and the pre-determined 
statistical variation in the population, it will also be 
possible to deliver a "confidence value" as to how closely 
the patient expression profile matches the "standard" 
canonical profile. This value will provide the clinician 
with valuable information on the trustworthiness of the 
classification, and, for example, whether or not the 
analysis should be repeated. 

As mentioned above, it is also possible to store the 
patient expression profiles on the database, and these may 
be used at any time to update the database. 

Aspects and embodiments of the present invention will now 
be illustrated, by way of example, with reference to the 
accompanying figures. Further aspects and embodiments will 
be apparent to those skilled in the art. All documents 
mentioned in this text are incorporated herein by reference 

Figure 1: Unsupervised Partitioning of Normal and Tumour 
Breast Samples. Individual expression profiles were 
subjected to standard data selection filters (see text), • 
and the resultant data matrix, comprising approximately 800 
array targets, was sorted using hierarchical clustering. 
Normal samples ('xxxN') are underlined, while tumour 
samples ('xxxT') are not. Numbers represent the NCC Tissue 
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Repository numbers associated with each sample. The 
dendogram branches illustrate the extent of similarity 
between the biological samples. Normal and Tumour samples 
segregate independently, but only at secondary levels of 
the dendogram. Minor variations on the data filters used to 
select this data set also yielded highly similar dendograms 
(P. Tan, unpublished observations) 

Figure 2: Improvement of Normal and Tumour Sample 
Partitioning Using Combined Outlier Genesets (COG) . (A) 
Independent outlier genesets for normal (left) and tumour 
(right) samples were defined. Each clustergram consists of 
a matrix of array targets (rows) by biological samples 
(columns), and light grey represents upregulation, while 
dark grey represents downregulation (see Materials and 
Methods for selection criteria) . The outlier geneset for 
normal samples consists of 60 genes, while the outlier 
geneset for tumour samples consists of 75 genes. Specific 
normal and tumour samples used in the establishment of the 
outlier genesets are listed below each clustergram. 
Underlined sample numbers indicate reciprocal 
hybridizations, where the tumour /normal sample was labelled 
using Cy5 and the reference sample Cy3. (B) Partitioning of 
normal and tumour samples using the COG. The 108 unique 
array targets comprising the COG were used to segregate the 
tumour and normal samples from Figure 1 using standard 
hierarchical clustering. In contrast to Figure 1, division 
of the normal (xxxN) and tumour (xxxT) samples is now 
observed as a primary class division, with 2 
misclassif ications . 



30 



WO 03/070979 



PCT/GB03/00755 



Figure 3: Partitioning of Normal and Tumour Samples using a 
Minimal 20-Element Genetic Identifier. The 20 array targets 
from the COG (Table 2) that were most highly correlated to 
the tumour/normal class distinction were used to segregate 
(A) the training set from Figures 1 and 2b, and (B) a naive 
test set of 10 normals and 11 tumours. In both cases, 
accurate segregation of normal and tumour samples at the 
level of the primary class division can be observed. 

Figure 4: Comparison of expression profile variation in 
normal and tumour samples. Independent normal and tumour 
datasets were established using the combined samples of 
Figure 3a and 3b (total = 48 samples) . Using PCA,. the 
entire gene expression matrix of approximately 8000 array 
targets in these datasets were reduced to basic principal 
components. The extent of variance of each component 
normalized to the 1 st component (normalized eigenvalue) is 
depicted on the y-axis, and the principal component number 
on the x-axis, beginning with the 2 nd component (since the 
first component of each set is 1). To observe the rate of 
Mecay' of information, the components for each dataset are 
depicted in decreasing order of variance. Normal samples 
consistently exhibit a lower information decay rate across 
their components compared with tumours. 

Figure 5: Gene expression patterns of 62 samples including 
5 6 carcinomas and 6 normal tissues, analyzed by 
hierarchical clustering using different gene sets. Samples 
were divided into 6 subtypes based on differences in gene 
expression (legend), and are : Luminal , (SI); ERBB2+/ER+ 

(S2, ERBB2+/er- (S3), Basal-like (S4), ER negative subtype 

II (S5), and Normal/Normal-like (S6) 
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(a) Unsupervised hierarchical clustering using a dataset 
of 1796 genes. The gray underline indicates a cluster which 
contains a mixture of Luminal and ERBB2+/ER+ samples, (b) 
Semi-supervised hierarchical clustering using the 'common 
intrinsic gene set' (CIS, 292 genes) . (c) The full cluster 
diagram using the CIS. Shaded bars to the right of the 
clustergram represent gene clusters A-E (Table 3), and are 

(A) Luminal epithelial genes with ER. (B) 'Novel' genes. 

(C) Basal epithelial genes. (D) Normal breast-like genes. 

(E) ERBB2-related genes. 

Figure 6 (a) - (d) Representative Examples of DCIS Samples 
Used in this Study. Two samples are shown (a) /(b), and 
(c)/(d). The DCIS status of each sample was confirmed both 
by examination of paraffin H & E sections of samples ((a) 
and (c) , HE), as well as frozen cryosections ((b) and (d) , 
FS) of the actual sample that was processed for expression 
profiling, (e) distinct Origins' and 'Evolutionary' 
Theories of Breast Cancer Development. The 1 Distinct 
Origins' hypothesis proposes that different molecular 
subtypes of cancer arise via different tumorigenic 
pathways, and thus constitute distinct biological entities. 
The 'Evolutionary' hypothesis proposes that the different 
molecular subtypes arise as a result of a single (or a few) 
cancer classes undergoing different stages of phenotypic 
development. One cannot distinguish between the two 
hypotheses by only studying advanced invasive cancers 
obtained at a single point in time. 

Figure 7: DCIS samples express the hallmark genes of 
advanced carcinoma subtypes. DCIS samples are shown as dark 
vertical lines. Based upon the CIS geneset, six out of 
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twelve DCIS samples cluster within the ERBB2+ groups (S2 
and S3) , 5 samples in the Luminal group, and one sample was 
in the normal-like group. Shaded bars to the right of the 
clustergram represent the same gene clusters as shown in 
5 Figure 5. (A) Luminal epithelial genes with ER. (B) Basal 

epithelial genes. (C) Normal breast-like genes. (D) ERBB2 . 

Figure 8: Summary of pathway-specific and overlapping genes 
for the Luminal A and ERBB2+ tumor subtypes. 'U' indicates 

10 upregulated genes and 'D' indicates downregulated genes. 

For example, there are 245 genes upregulated and 705 genes 
downregulated during the normal/DCIS (Luminal ) transition. 
Numbers in bold are overlapping genes between two gene 
sets, a) Results based upon a false-discovery rate (FDR) of 

15 5%. b) Results when only the top 100 most significantly 

regulated unique genes are compared. 

Figure 9. a) Discovery of a Luminal D subtype. A series of 
previously homogenous Luminal A tumors (identified as 

20 subtype SI by the CIS in Figures 5 and 7 were regrouped by 

hierarchical clustering based upon 'proliferation cluster' 
linked genes. Two broad groups are observed, which exhibit 
low (Luminal A) and high (Luminal D) levels of expression 
of the 'prolif eration cluster' respectively, b) High 

25 levels of the 36-gene 'proliferation cluster' is also 

observed in other aggressive tumor types. Luminal D (15 out 
of 17 samples, indicated as dark bars under sample 
numbers), Basal (ER- ) and ERBB2+ve samples all strongly 
express the 36-gene 'proliferation cluster' (bar below 

30 clustergram, left branch) , while Luminal A (all but one 

boundary case) , normal-like and normals are show low levels 
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of expression. Light grey/white indicates upregulation, 
while dark grey/black indicates downregulation . 

Materials and Methods 

Breast Tissue Samples 

Primary breast tissues were obtained from the NCC Tissue 
Repository, after appropriate approvals had been obtained 
from the institution's Repository and Ethics Committees. In 
general, all tumour and matched normal tissues were 
simultaneously harvested during surgical excision of the 
tumour. After surgical excision, the samples were 
immediately grossly dissected in the operating theatre, and 
flash-frozen in liquid N2 . Histological confirmation of 
tumour status was subsequently provided by the Dept of 
Pathology at Singapore General Hospital. Samples were 
stored in liquid N2 until processing was performed. With 
the exception of 1 tumour and matched normal sample pair 
that came from an Indian patient, all other samples were 
derived from Chinese patients. Confirmation of the DCIS 
status of tissue samples used in this report was achieved 
both by conventional H & E staining of archival samples, as 
well as direct cryosections of the actual sample that was 
processed for expression profiling. 

Sample Preparation and Microarray Hybridization 
For hybridisations involving Affymetrix Genechips, RNA was 
extracted from tissues using Trizol reagent, purified 
through a Qiagen Spin Column, and processed for Affymetrix 
Genechip hybridization according to the manufacturer's 
instructions. For each spotted cDNA microarray 
hybridization 2-3 ug of total RNA was used following 
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single-round linear amplification (Wang et al . , 2000). All 
breast samples for the spotted cDNA microarray 
hybridisations were compared against a standard 
commercially available mRNA reference pool (Strategene) 
5 that had been similarly amplified. cDNA microarrays were 

fabricated following standard procedures (DeRisi et al., 
1997) , using cDNA clones obtained from various commercial 
vendors (Incyte, Research Genetics) . Except where 
mentioned, samples were f luorescently labelled using Cy3 
10 dye, while the reference was labelled with Cy5 . 

Hybridizations were performed using Affymetrix U133A 
Genechips. After hybridization, microarray images were 
captured using a CCD-based microarray scanner (Applied 
Precision, Inc) . 

15 

Data Processing and Analysis 

For spotted cDNA microarray data, fluoresence intensities 
corresponding to individual microarrays were uploaded into 
a centralized Oracle 8i database. Establishment of various 

20 data sets and gene retrievals were performed using standard 

SQL queries. Hierarchical clustering was performed using 
the program Xcluster (Stanford) and visualized using the 
program Treeview (Eisen et al., 1998). To identify outlier 
genes in tumour and normal datasets, array elements were 

25 chosen which consistently exhibited greater than 3-fold 

regulation across 90% of all arrays for the normal dataset 
and 80% of all arrays for the tumour dataset. Correlation 
analysis was performed using the similarity metric concept 
employed in Golub et . al. (1999) . Briefly, the similarity 

30 metrics corresponding to the normal/tumour class 

distinction were calculated for each gene, and the genes 
then sorted based on descending order of their similarity 
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values. After being sorted by their positive and negative 
correlation to the class distinction, the top 10 genes from 
each class were chosen for subsequent cluster analysis. 
Principal Component Analysis (PCA) was performed by 
5 linearly transforming the gene expression matrix, which 

consists of a number of correlated variables, into a 
'smaller' number of uncorrelated variables (principal 
components) . For datasets in linear subspace, the data can 
be 'compressed' in this manner without losing too much 
10 information while simplifying the data representation. The 

first principal component accounts for maximum variability 
in the data, and each succeeding component accounts for 
parts of the remaining variability. 

15 For Affymetrix Genechips, Raw Genechip scans were quality 

controlled using a commercially available software program 
(Genedata Refiner) and deposited into a central data 
storage facility. The expression data was filtered by 
removing genes whose expression was absent in all samples 

20 (ie W calls), subjected to a log2 transformation, and 

normalized by median centering all remaining genes and 
samples. Data analysis was then performed either using the 
Genedata Expressionist software analysis package or using 
conventional spreadsheet applications. The unsupervised 

25 dataset of 1796 genes used in Figure 1 was established by 

selecting genes exhbiting a standard deviation (SD) of >1 
across all well-measured samples. Average-linkage 
hierarchical clustering, was applied by using the CLUSTER 
program and the results were displayed by using TREEVIEW 

30 (9) . Significance analysis of microarrays (SAM) was 

performed essentially as described in Tusher et al., (2001) 
(10), using a fold-change cutoff of 2 and an appropriate 
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delta value to cap the gene false-discovery rate (FDR) at 
5% (0.05) . 

Creation of a Common Intrinsic Geneset (CIS) 

Genes common to both the U133A Genechip Probe Set and the 
'intrinsic' dataset as defined in Perou et al., (2000) were 
selected in the following manner : Out of the original 
'intrinsic' set consisting of 456 cDNA clones, 428 could be 
assigned to a specific Unigene cluster using the Stanford 
Source database (Unigene Build 156) . This number was then 
reduced to 403 genes after the removal of duplicate genes. 
The U133A Genechip probe set was then queried using this 
list, yielding 292 matches, or 72.5% of the original 
'intrinsic' set (counting only unique genes) . 



Results 

Partitioning of Normal and Tumour Breast Specimens Usi ng 
Unsupervised Clustering 

The inventors used cDNA microarrays of approximately 13,000 
elements to generate gene expression profiles for a set of 
26 grossly-dissected breast tissue specimens (14 tumour, 12 
normal) obtained from patients of primarily Chinese 
ethnicity (see Materials and Methods) . After hybridization 
and scanning, approximately 8,000 array elements were found 
to exhibit flourescence signals significantly above 
background levels, and these elements were used for 
subsequent analysis. Initially, the inventors found that an 
unsupervised clustering methodology based upon a number of 
commonly used data filters (e.g. selecting genes exhibiting 
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at least 3-fold regulation across at least 4-5 arrays) (see 
Perou et al . , 1999, Wang et al., 2000) resulted in an array 
clustergrara shown in Figure 1. Specifically, the sample set 
segregated into two broad groups, with each group 
consisting of a mixture of tumour and normal specimens. 
However, within each group, the inventors found that the 
tumour and normal tissues effectively segregated into 
fairly independent sub-branches. The observation that 
tumour and normal tissues can be segregated using 
unsupervised clustering suggests that specific genes may 
exist that can effectively distinguish between a tumour and 
normal sample. However, in the context of a large 
unsupervised data set, it is also clear that these genes 
are only capable of distinguishing between normal and 
tumour samples in sub-branches of the correlation 
dendogram, rather than at the level of a primary class 
division. Similar findings have also been reported in other 
breast cancer expression profiling projects (Perou et al . , 
2000), suggesting that at the level of global 
transcriptosome, the expression levels of other genes may 
^supercede' the information encoded by genes involved in 
the tumour/normal class distinction (see discussion). 

Use of Outlier Genesets to Classify Normal and Tumour 
Samples 

One of the main objectives of the inventors' research is to 
identify genes or gene subsets that are of significant 
diagnostic or therapeutic potential. To be of clinical 
utility, it will be necessary to identify a class of genes 
that can accurately predict if an unknown breast tissue 
sample is normal or malignant at the level of the primary, 
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rather than secondary, class division. To identify these 
genesets, or 'genetic identifiers', a number of supervised 
learning strategies, such as neigborhood analysis and 
artificial neural networks, have been previously described 
(Golub et al., 1999, Khan et al., 2001). However, the 
inventors used a slightly different strategy to identify 
these elements that focuses on the use of highly 
reproducible outlier genes. In this methodology, samples 
belonging to different classes are initially established as 
independent datasets. Within each group, genes that are 
consistently up or downregulated ( 'outliers' ) across all or 
close to all arrays are then identified. These separate 
'outlier groups' are then combined, and the ability of the 
combined set of genes to distinguish between the two 
classes is then assessed using standard clustering 
methodologies . 

The inventors first established outlier gene subsets for 
both the normal and tumour populations. To avoid biases 
that might be introduced by fluorophore labelling, they 
also included in each group 5 'reciprocal' expression 
profiles in which the sample and reference RNA population 
were inversely labelled. This analysis identified 60 highly 
reproducible 'outlier' genes for the normal group and 75 
genes for the tumour group that were either consistently up 
or down-regulated across all or close to all arrays (Figure 
2) . A cross-comparison of the normal and tumour outlier 
sets revealed a number of genes in common between both sets 
(Table 1) , leading to a final combined outlier geneset 
(referred to as the COG) of 108 genes. 
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The COG was then used to cluster the 26 breast tissue 
samples. In contrast to the large-scale clustergram 
observed in Figure 1, the inventors found that clustering 
using the genes found in the COG effectively segregated the 
majority of tumour and normal samples into two principal 
branches, with 2 mis-classifications (Figure 2a) . 
Specifically, 1 normal sample and 1 tumour sample were mis- 
assigned, and in the former case a guality check of the 
gene expression values revealed that this sample was 
associated with a number of so-called ^missing' values 
(grey bars in clustergram) , which may have led to this 
sample being mis-classified. Nevertheless, the majority of 
samples were correctly grouped, suggesting that for certain 
datasets, ^outlier analysis' may serve as a simple and 
effective method to identify discriminating genes between 
distinct classes. 

Definition of a Minimal Genetic Identifier for the Normal 
vs Tumour Class Distinction in Breast Tissues 

Despite representing a dramatic reduction in the number of 
genes from the initial data set (8,000 to 108), the number 
of elements contained in the COG is still too large to be 
feasibly included in its entirety as part of a potential 
diagnostic assay. Ideally, a diagnostic geneset should 
consist of i) a minimal number of elements, ii) be of high 
predictive accuracy, and iii) represent a mixture of genes 
that are positively and negatively correlated to the class 
distinction in question. To further reduce the combined 
outlier geneset to its most informative elements, the 
inventors used correlation analysis to identify and rank 
genes in the COG that are most highly correlated to the 
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tumour/normal class distinction (see Materials and 
Methods). The 10 most highly positively and negatively 
correlated genes were then assessed in their ability to 
accurately classify the breast samples. The inventors found 
that this minimal set of 20 genes, referred to as a 
'genetic identifier, accurately classified all of the 
normal and tumour samples (Figure 2b and Table 2). The 
genes that make up the 'genetic predictor' represent a 
mixture of genes known to be involved in breast and tumour 
biology, as well as other genes whose role in tumour 
formation have not as yet been described (see discussion) . 

Predictive Capacity of the 20-gene 'Genetic Identifier' 

All analyses done up to this point were performed on the 
same 'training' set of 26 breast samples, and thus the 
predictive power of the 20-element geneset has not been 
addressed. To assess the robustness of this 'genetic 
identifier' , the inventors followed the strategy of Golub 
et al (1999) and tested the ability of the minimal 
predictor to classify a naive 'test set' of another 22 
breast samples, of which 12 samples were tumours and the 
remaining 10 were non-malignant. In a similar fashion to 
the training set, they found that the 20-gene genetic 
identifier was also able to classify the naive set with 
complete accuracy (Figure 3b) . Thus, it appears that the 
ability of the 'genetic identifier to predict if a given 
breast sample is normal or malignant is not confined to the 
training-set from which it was generated. Instead, the 
number of elements in this geneset, although minimal, may 
be of sufficient sensitivity and informative power to give 
it predictive value. 
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Assessing the Global Level of Variation between Normal and 
Tumour Breast Tissues 

5 Breast tumours are clinically characterized by wide 

variations in clinical courses, disease aggressiveness, and 
response to medication. Consistent with these wide 
phenotypic variations has been the finding that individual 
breast tumours can exhibit large variations in their global 
10 gene expression patterns (Perou et al., 2000). One common 

hypothesis to explain these wide variations is to consider 
them as the consequences of multiple independent pathways 
of tumourigenesis . However, normal breast tissues are also 
highly environmentally and hormonally sensitive, and the 
15 specific state of a normal breast tissue in a particular 

patient is often dependent upon numerous demographic 
factors, such as age, menopausal status, and medication 
history. Thus, it is formally possible that a certain 
amount of the variations in expression state observed in 
20 tumours may also be reflected in non-malignant breast 

tissue as well. Since the inventors' data set consists of 
both normal and malignant samples, they were able to 
compare the inherent variability of normal and tumour 
samples to each other. To perform this comparison, they 
25 utilized principal component analysis (PCA) on the entire 

8,000 gene expression matrix, comprising a total of 22 non- 
malignant and 26 tumour specimens. Using PCA, the inventors 
reduced the total gene set to a series of distinct 
'components' , in which each component represents a finite 
30 amount of gene expression variation across the primary data 

set. They hypothesized that observed variation in the data 
could arise from multiple sources, such as intrinsic 
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biological variation, as well as experimentally introduced 
variation (such as differences in sample harvesting, 
hybridization and labelling conditions, etc) . However, 
since the normal and tumour samples were identically 
harvested, treated and processed in their experiments, 
variations due to experimental conditions and handling 
should be equally shared between both groups. Thus, any 
differences in variation between the tumour and normal 
groups can most likely be attributed to intrinsic 
biological variation. 

The inventors plotted the amount of variation observed in 
the normal and tumour data sets against their principal 
components (Figure 4) . In order to effectively compare the 
two datasets, each component was normalized to the first 
component in that dataset, resulting in a graph that 
depicts how the total variation across the dataset Mecays' 
with each successive principal component (By convention, 
the first principal component is usually taken to represent 
the elements that exhibit maximal variation across the 
dataset) . The inventors observed that as a general rule, 
every component corresponding to the tumour data set 
consistently exhibited higher variation than an analogous 
component in the normal data set. This data indicates that 
the gene expression profiles of normal breast samples are 
significantly more ^static' or 'unchanging' when compared 
to tumour profiles, supporting the hypothesis that the wide 
variations in gene expression observed in tumours may be a 
consequence of breast tumours arising from multiple 
tumourgenic pathways. 
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Conservation of Molecular Subtypes of Breast Cancer Across 
Distinct Ethnic Populations 

The inventors then used Affymetrix Genechips to profile 56 
invasive breast cancers and 6 normal breast tissues that 
had been isolated from Chinese patients. The raw expression 
profile scans were subjected to one round of quality 
control, data filtering and processing (see Materials and 
Methods), and an unsupervised hierarchical clustering 
algorithm was used to order the normalized profiles to one 
another on the basis of their transcriptional similarity. 
Using a dataset of 1796 genes, which constitute genes that 
are both well-measured across at least 70% of all samples 
and which exhibited considerable transcriptional variation 
across the samples (as reflected by having a high standard 
deviation), the inventors observed that the majority of the 
samples segregated into several discernible groups that 
could be correlated to specific histopathological 
parameters. For example, many of the ER + tumors clustered 
together ((SI) bar, Figure 5a), as did the ERBB2 +/ ER - 
samples ((S3) bar). The normal breast samples also 
clustered as a discernible group whose individual members 
exhibited very high correlation to one another, suggesting 
that there is less transcriptional variation in normal 
breast tissues as compared to tumors. A number of samples, 
however, were not accurately segregated by the unsupervised 
clustering algorithm (gray bar) - it is possible that such 
^mixed clustering' results may be attributable to *noise' 
contributed by non-malignant components in the primary 
tissue sample, such as normal breast epithelial tissue, 
lymphocytic infiltrates, and reactive desmoplastic tissue. 
As previously mentioned, a similar observation was obtained 
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using the cDNA microarray platform, suggesting that this 
phenomena is technology-platform independent. 

One objective of this study was to determine if the 
molecular subtypes and associated expression signatures 
defined in previous published studies were also detectable 
in a separate patient population. The inventors focused on 
correlating their expression results to that of Perou et al 
(2000), a landmark study in which a similar analysis had 
been performed on a series of breast cancer specimens 
derived from US and Norwegian patients. Briefly, in that 
study and a subsequent companion report (Sorlie et al., 
2001), the authors determined that invasive breast cancers 
could be subdivided into at least 5 distinct molecular 
subtypes based upon an 'intrinsic' geneset representing 
genes whose transcriptional variation is primarily due to 
the malignant tumor component. The specific expression 
signatures that represent the 'hallmark' elements of each 
particular subtype are summarized in Table 1 (this dataset 
is henceafter referred to as the Stanford study) . Between 
the Stanford study and the inventors work, there are 
several differences in methodology and experimental design, 
such as differences in sample handling protocols, patient 
population, and expression array platform (2-color cDNA 
microarray in the Stanford study vs 1-color Genechips in 
the inventors' study, as well as different array probe 
sequences). The availability of two distinct breast cancer 
expression datasets from independent institutions (Stanford 
and the inventors) thus allowed the inventors to test 
whether, despite these differences, if the molecular 
subtypes defined in one institution's experiments are 
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indeed sufficiently robust to be detectable in another 
institution's study. 

To perform this analysis, the inventors first identified 
probes on the Affymetrix U133A Genechip corresponding to 
genes belonging to the 'intrinsic' set as defined by the 
Stanford study (see Materials and Methods) . Of 403 unigue 
genes found in the Stanford 'intrinsic' set, 292 genes, or 
72.5% of the intrinsic set, were also found on the Genechip 
array. The inventors henceforth refer to this overlapping 
set of genes as the 'common intrinsic set' (CIS) . 
Importantly, the CIS still contains many of the 'hallmark' 
genes whose transcription was reported in the Stanford 
study to be useful for discriminating between subtype, and 
reclustering of the Stanford tumors using the CIS also 
yielded highly similar groupings to that obtained using the 
full intrinsic set (data not shown) . When the invasive 
cancers in the inventors' series were reclustered on the 
basis of the CIS, they observed a striking improvement in 
the segregation pattern where now all the cancer samples 
grouped into highly distinct classes. The inventors then 
proceeded to compare the molecular subtypes defined in 
their study to those discovered by the Stanford study 

(Luminal A, Luminal B/C, Basal, Normal-like, and ERBB2+) 

(Perou et al., 2000; Sorlie et al., 2001). 

Luminal subtypes : All of the cancers in this group were ER 
+ by conventional immunohistochemisty . The Stanford study 
defined at least two groups of luminal tumors - Luminal A 
and Luminal B/C, the latter being associated with a poorer 
clinical prognosis (Luminal B and C tumors are treated as a 
single class, as it is reportedly difficult to divide them 
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into two discrete groups (Sorlie et al., 2001). Consistent 
with the Stanford study, the inventors also observed the 
presence of a robust Luminal molecular subtype that was 
highly similar to the Luminal A subtype of the Standford 
study, as this subtype was characterized by high levels of 
expression of ER and related genes such as GATA3, HNF3a, 
and X-box Binding Protein 1 (bar (SI). They could not, 
however, clearly determine if the Luminal B/C subtypes as 
defined by the Standford study were also present in their 
patient population, based upon the criteria that both the 
B/C subtypes are associated with intermediate levels of ER 
related gene expression, and that the luminal C subtype 
also expresses high levels of a 'novel' gene cluster. The 
inventors also observed the presence of a second luminal 
subclass (ER+ /ERBB2+) which was distinct from the luminal 
A cancers in that this other subclass expressed 
intermediate levels of ER-related genes (similar to Luminal 
B/C) and genes found in the 'novel' cluster (similar to 
luminal C, bar (S2) . This subclass, however, also expressed 
high levels of ERBB2-related genes, and is thus likely to 
be distinct from the luminal C cancers defined by the 
Stanford study, as luminal C cancers express low levels of 
the ERBB2 gene cluster. Taken collectively, the inventors' 
results indicate that Luminal A tumors ("Luminal in Fig. 5) 
constitute a robust molecular subtype that can be commonly 
found across different patient populations. Conversely, the 
luminal B/C and ER+/ ERBB2 +ve subtypes may represent less 
robust variants whose presence may be more significantly 
affected by differences in ethnic specificity, sample 
handling protocols, or array technology. 
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As seen in Figure 5, tumours belonging to the Luminal 
category (subtype SI) appear to be transcriptionally 
homogenous on the basis of the CIS. To determine if 
tumours belonging to this subtype could be further 
subdivided, the inventors reclustered a larger group of 
Luminal tumours using a separate set of genes which in a 
previous report had been shown to be indicative of a 
tissue's cellular proliferative status (Sorlie et al . , 
2001) - 

On the basis of these "proliferation genes", they found 
that the Luminal tumours could be subdivided into two 
distinct types, namely, "pure" luminal A and another 
subtype that they have referred to as a Luminal D subtype 
(Figure 9a) . It is likely that the Luminal A/D subdivision 
is clinically meaningful, as a reclustering of a more 
diverse set of tumours on the basis of the "proliferation 
genes" resulted in two broad subdivisions, one representing 
clinically aggressive tumours (Basal, ERBB2 and Luminal D) , 
and the other representing tumours that are more clinically 
tractable (Luminal, Normal/Normal-like ) (Figure 9b) . 

Basal-like : The basal molecular subtype was reported in 
the Stanford study to be characterized by high levels of 
two expression signatures - I) markers of the basal mammary 
epithelia, such as keratin 5 and 17, and II) genes 
belonging to the 'novel' cluster. Consistent with the 
Stanford study, the inventors also observed a basal subtype 
associated with similar expression signatures (bar(S4)), 
indicating that the basal molecular subtype is also highly 
robust. In addition, however, they also detected the 
apparent presence of another subtype (bar (S5) ) that was 
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not associated with any of the expression signatures 
described in the Stanford study. 

Normal Breast-like : The 'normal-like' subtype is 
ssociated with expression of a gene cluster that is also 
highly expressed in normal breast tissues, and includes 
genes such as four and a half LIM domains 1, aquaporin 1, 
and alcohol dehydrogenase 2 (class I) beta. A number of 
tumors in the inventors' series also clustered with the 
normal breast tissues and exhibited this expression 
signature (bar (S6) ) . Thus, the 'normal-like' molecular 
subtype can also be considered to be a robust subtype. 

ERBB2 + : The Stanford study also defined a final ERBB2 + 
subtype in which these tumors were characterized by high 
levels of expression of ERBB2 related genes (column E) , 
intermediate levels of expression of the 'novel' cluster 
(column B), and absent expression of ER-related genes 
(column A) . A similar ERBB2 + subtype was also clearly 
present in the inventors' series (bar (S3)). Consistent 
with the expression data, they also subsequently confirmed 
that the tumors belonging to this molecular subtype were 
all ERBB2+ by conventional immunohistochemistry as well. 

To summarize, of the 5 molecular subtypes defined by the 
Stanford study, the inventors clearly detected at least 
4 subtypes in their own patient population (luminal A, 
basal-like, normal breast-like, and ERBB2+) . They could 
not clearly determine if one particular subtype (luminal 
B/C) was present in their series using the genes in the 
CIS, and they also detected the potential presence of 2 
additional subtypes (ER+ ERBB2+ and ER- Subtype II) which 
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have not been reported before. The finding that that the 
majority (4/5) of the Stanford molecular subtypes were also 
clearly detectable in the inventors' study suggests that 
despite many methodological differences between centres, 
that molecular subtypes as defined by expression based 
genomics are indeed remarkably robust and conserved between 
different patient populations. 

Ductal Carcinoma in situ (DCIS) Cancers Express The 
Hallmark Expression Signatures of Invasive Cancer Molecular 
Subtypes 

The previous results indicate that molecularly similar 
subtypes of breast cancer can indeed occur and be detected 
across distinct ethnic populations. One limitation of these 
studies, however, is that it is often very difficult to 
profile the same cancer over an extended period of time. As 
such, one question that is often raised is whether these 
molecular variants represent subtypes that are truly 
distinct biological entities, or whether they simply 
reflect a single or a few subtypes in different stages 
of evolution. Since these two different theories, referred 
to as the 'distinct origins' and the 'evolutionary' 
hypotheses respectively (Figure 6e) , have different 
implications for clinical diagnosis and subsequent staging 
and monitoring, it is thus important to determine which of 
these proposed mechanisms is the case for breast cancer. 
Unfortunately, it is not possible to distinguish between 
these two models by only studying invasive cancers that 
have been sampled at a single point in time, as both 
hypotheses would be expected to produce results similar to 
that shown in Figure 5. 
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In conventional histopathology, ductal carcinoma-in-situ 
(or DCIS) has long been recognised as the major precursor 
to invasive breast cancer, and likely represents the 
earliest morphologically detectable malignant non-invasive 
breast lesion. Despite their malignant status, however, 
DCIS cancers are also distinct from invasive cancers in a 
number of respects. Clinically, DCIS cancers are treated 
differently from invasive cancers (DCIS cases are primarily 
treated with surgery with or without adjuvent radiotherapy) 
(Harris et al., 1997), and DCIS and invasive cancers also 
differ substantially in their distribution of specific 
cancer types (Barnes et al . , 1992; Tan et al., 2002). 
Differences such as these raise the possibility that while 
DCIS cases are malignant, they may also be molecularly 
distinct in some respects from more advanced invasive 
cancers. The inventors reasoned that the 'distinct 
origins' and 'evolutionary' hypotheses could be tested by 
profiling a series of DCIS cancers and comparing their 
profiles to their invasive counterparts. Each hypothesis 
carries different predictions. If the 'distinct origins' 
hypothesis is true, then the DCIS cancers, representing 
'early' cancers, should express many, if not all, of the 
hallmark expression signatures associated with their more 
mature invasive counterparts. Alternatively, if the 
'evolutionary' hypothesis is correct, then one might expect 
that the DCIS profiles to be more closely similar to one 
another than to their invasive counterparts. The inventors 
obtained 12 DCIS tissue samples whose histopathological 
status was confirmed by a pathologist both using 
conventional H & E staining as well as frozen cryosections 
of the actual sample that was processed (Figure 2a and b) . 



51 



WO 03/070979 



PCT/GB03/00755 



Expression profiles of the DCIS samples were then generated 
and compared to their invasive counterparts. Using the CIS 
as a starting dataset, the inventors found that the DCIS 
samples segregated amongst the various invasive cancer 
samples into distinct categories. Specifically, 5 DCIS 
samples segregated into the Luminal subtype, 4 into the ER- 
/ERBB2 + subtype, 2 into the ER +/ ERBB2+ subtype, and 1 
into the ^normal breastlike' subtype. Importantly, within 
each subtype, each of the DCIS cancers was found to 
robustly express the hallmark expression signatures of its 
particular molecular group. Interestingly, no DCIS samples 
were found to cluster within the basal or ER- subtype II 
molecular subtypes, which is consistent with previously 
proposed theories that these subtypes may develop without a 
(or possess an extremely transient) DCIS component (Barnes 
et al., 1992). These results suggest that distinct breast 
cancer molecular subtypes are present even at the DCIS 
stage of breast cancer tumorigenesis , supporting the 
hypothesis that the subtypes represent truly distinct 
biological entities, possibly arising via different 
tumorigenic pathways (the 'distinct origins' hypothesis) . 

Genes Associated with the Normal/DCIS/Invasive Cancer 
Transitions Implicate Disregulation of Wnt Signaling as a 
Common Early Event in Breast Tumorigenesis and that Luminal 
A and ERBB2+ Cancers Exhibit Similar Invasion Programs 

Mammary tumorigenesis can be broadly divided into two main 
steps : First, normal breast epithelial tissue is 
transformed to a malignant state via the concerted 
deregulation of various cellular pathways (Hahn and 
Weinberg, 2002) . Second, to progress to an invasive cancer, 
several additional biological subprograms also have to be 
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further executed, including penetration of the surrounding 
basement membrane, invasion of the cancer into the adjacent 
normal stroma, and angiogenic recruitment of endothelial 
vessels for tumor nourishment and maintenance (Hanahan and 
5 Weinberg, 2000) . Given the molecular heterogeneity of 

breast cancer, one important question in the field is the 
extent to which the genetic programs that control these two 
key steps are subtype specific or commonly shared among all 
breast cancer subtypes. 

10 

To identify genes whose expression level was significantly 
different between normal breast tissues, DCIS cancers, and 
their invasive counterparts, the inventors used 
significance analysis of microarrays (SAM), a robust 

15 statistical methodology that has been used in previous 

reports to identify significantly regulated genes (Tusher 
et al., 2001). They concentrated on studying the luminal 
and ERBB2+ cancers, as most of the DCIS samples in their 
study belonged to these two molecular subtypes. First, they 

20 tested and confirmed the hypothesis that DCIS cancers, 

despite expressing many of the hallmarks of invasive 
cancers, are nevertheless still transcriptionally distinct 
from invasive cancers. The inventors compared 5 luminal 
DCIS cancers to 5 luminal invasive cancers, and determined 

25 that there existed 222 genes that were significantly 

regulated using a 2-fold cut-off criterion and a false- 
discovery rate (FDR) of 5%. In contrast, a control analysis 
comparing only invasive luminal A cancers which had been 
randomly distributed into 2 groups failed to identify any 
30 significantly regulated genes under these stringent 

conditions. A similar result was also obtained for DCIS and 
invasive cancers belonging to the ERBB2+ subtype (data not 
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shown) , indicating that significant transcriptional 
differences exist between DCIS and invasive cancers 
belonging to both the Luminal A and ERBB2+ subtypes. 

SAM was then used to identify genes that were significantly 
regulated during either the normal/DCIS and DCIS/invasive 
transitions for both the luminal A and ERBB2 molecular 
subtypes (FDR = 5%). The results are summarized in Figure 
8a. In total, for the luminal A subtype, a greater 
number of genes were significantly down-regulated during 
the normal/DCIS transition than upregulated (705 genes down 
vs 245 genes up), while for the DCIS/invasive transition 
more genes were significantly increased in expression than 
decreased (56 genes down vs 277 genes up) . Similarly, for 
the ERBB2 subtype, 367 genes were significantly 
downregulated and 275 genes upregulated during the 
normal/DCIS transition, while 113 genes were downregulated 
and 294 genes upregulated during the transition from DCIS 
to invasive cancer. 

The following provides an outline as to how the genesets of 
Table 4, 5, 6 and 7 were determined. 

A "Genetic Identifier" that can Distinguish between a 
normal vs Tumour Breast Sample 

Methodology : 

Data set: 95 Breast Tissue Samples (11 Normal and 84 
Tumors) 
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Step 1: The data for each sample was normalized by median 
centering each expression profile around 5000 flouresence 
units (the Genechip technology measures expression 
abundance of each gene in terms of flouresence units, from 
0 to 65535) 

Step 2: An intensity filter was applied such that only 
genes with intensity values in the range of 200 to 100,000 
were retained 

Step 3: A 'Valid value' filter was applied such that genes 
that were at least 70% present (ie above a minimum 
threshold value, usually about 200) in either normals or 
tumors or both were retained chosen 

Step 4: A statistical T-test was performed to select genes 
that were differentially expressed in normal vs tumors at a 
confidence level of p < 0.00001. This resulted in the 
selection of 507 genes 

Step 5: Of the 507 genes, a high fold change filter was 
applied to select genes that exhibited large differences in 
expression between normal and tumor samples (2.5-fold and 
above) . This resulted in the identification of 49 genes (up 
in tumors) and 81 genes (up in normals) respectively. These 
genes are listed in Table 4a. 

Step 6: The 130 (49 and 81) genes were ranked using support 
vector machine gene ranking in order to rank genes in the 
order of their importance in being able to assign an 
unknown breast sample to either a tumor or normal group. 
This was done to arrive at a small subset of genes that can 
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accurately predict normal from tumors. Top 32 genes gave 
close to 1% misclassification. The results are given in 
Table 4b. 

Step 7: The 32 geneset was tested for its predictive 
accuracy in the classification of normal vs tumor samples, 
using leave-one-out cross-validation (LVO CV) testing. No 
misclassifications were observed. 



Support 



Vector Machine (SVM) Gene Ranking 



This approach is used to rank the genes in a dataset 
according to their importance in being able to assign an 
unknown sample to a particular group. Typically, the 
samples in the dataset are divided into a (75%) training 
and (25%) test set. A maximum margin hyperplane separating 
the two classes (eg ER+ vs ER-) is calculated for the 
training set. 

Assuming *m' genes are present in the set, the equation of 
maximum margin hyperplane is 

H = Wi* Gi + W 2 * G 2 + + Wi* Gi + + W m * G m 

Where Wi' s are the weights and Gi' s refer to the variables 



25 (genes) . 



30 



Using the genes corresponding to various top 'N' weights 
(weight is indicator of importance of gene in 
classification) the class of all samples in the test set is 
predicted. The prediction rules are built for varying sets 



56 



WO 03/070979 



PCT/GB03/00755 



of top N genes. The above procedure is repeated 100 times 
and the gene ranks and misclassification rates are 
averaged. 

5 "Genetic Identifiers" that can Predict the Estrogen 

Receptor Status and the ERBB2 Receptor Status of a Breast 
Tumour Sample 

Methodology : 

10 

Data set: 55 invasive breast tumor samples. The individual 
tumors were assigned to the following groups on the basis 
of IHC (iitimunohistochemistry) : 

a) Estrogen receptor (ER) status: 35 ER positive and 20 
15 ER negative samples 

b) c-erbB-2 (ERBB2) status: 21 ERBB2 positive and 34 
ERBB2 negative samples. 

Step 1: Gene selection to identify genes that are 
20 differentially expressed between a) ER+ vs ER- tumors, and 

b) ERBB2+ vs ERBB2- samples. Three independent gene 
selection techniques were used : 

• Significance Analysis of Microarrays (SAM) , a 
25 statistical technique that uses random permutations of 

the expression data to estimate the x false discovery 
rate' , ie the chance at which a particular gene will 
be falsely called as being differentially expressed 
(Tusher et al., 2001). The genes are then ranked by 
30 their "relative difference", which is similar to the 

ranking used in Step 6, above. The top 100 significant 
genes were selected. 
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• A signal to noise (S2N) strategy was used to rank 
genes based on their correlation with the class 
distinction (either ER+/ER- or ERBB2+/ERBB2- ) (Golub 
et al., 1999). The top 100 genes were selected. 
5 • A support vector machine (SVM) ranking strategy was 

used to rank the genes according to their importance 
in assigning a breast tumor sample to the correct 
class (see below) . The optimal gene set (with highest 
accuracy) was selected. 

10 

Step 2: Common Gene Set (CGS) : The genes from the 3 
independent analysis were pooled, and the common genes 
selected by all three methods were selected. Hence these 
genes are method-independent and sufficiently robust to be 
15 used as a 'genetic identifier' to predict either the ER or 

ERBB2 status of a breast tumor sample. 



Result: 

• For ER classification, the CGS contains 25 unique 
20 genes (18 up, 7 down regulated) 

• For ERBB2 classification, the CGS contains 2 6 unique 
genes (19 up, 7 down regulated) 

The genes belonging to each CGS are listed in Table 5. 

25 Finally, the accuracy of each CGS for tumor classification 

was assessed using LVO CV testing. The classification 
algorithm used was a Support Vector Machine (SVM) . Average 
cross validation error rate = 7.286 % for ER classification 
(overall accuracy 92%), and 6.26% for ERBB2 classification 

30 (overall accuracy 93%) . 
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"Genetic Identifiers" that can Predict the Molecular 
Subtype of a Breast Tumour Sample 



Methodology- 
Data set : Expression Profiles for tumors belonging to tl 
various subtypes were generated using Affymetrix U13: 
Genechips. The hallmark expression signatures th; 
characterize each subtype are described above. 



a) Luminal (19) 

b) ERBB2 (19) 

c) Basal (7) 

d) ER negative type 2 (5) 

e) Normal and Normal like (12) 



A. Identification of a Minimal Geneset for Classificatio n 
Using a One-vs-All Support Vector Mach ine Approach 

Step 1: The data for each sample was normalized by median 
centering each expression profile around 1000 flouresence 
units (the Genechip technology measures expression 
abundance of each gene in terms of flouresence units, from 
0 to 65535) 

Step 2: A 'Valid value' filter was applied such that genes 
that were at least 70% present (ie above a minimum 
threshold value, usually about 200) across all samples were 
chosen 
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Step 3: Five different data sets were created are by 
leaving one of the above-mentioned groups out and combining 
the four remaining groups (ie x One-vs-all' ) ■ 



Dataset 


Description 


1 


Luminal (19) vs Rest (43) 


2 


ERBB2 (19) vs Rest (43) 


3 


Basal (7) vs Rest (55) 


4 


ER negative type 2 (5) vs Rest (57) 


5 


Normal and Normal like (12) vs Rest (50) 



Step 4: For each of the 5 datasets, genes were selected 
that exhibited a minimum 2 fold change between groups 
(Ratio of means was used to calculate the fold change 
10 between two groups) . 



The results are as follows 



Dataset 


Description 


Differentially 
regulated (2 
fold) 


1 


Luminal (19) vs Rest (43) 


116 


2 


ERBB2 (19) vs Rest (43) 


46 


3 


Basal (7) vs Rest (55) 


318 


4 


ER negative type 2 (5) vs 
Rest (57) 


309 


5 


Normal and Normal like (12) 
vs Rest (50) 


188 
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Step 5: A support vector machine gene ranking analysis was 
performed for each of the five datasets to rank genes in 
the order of their importance in assigning an unknown 
breast sample to its appropriate class (e.g. ER or ERBB2 
status, see above) . 

For datasets 1,3,4, and 5, a geneset was selected that 
yielded a 3% misclassif ication rate. In case the case of 
dataset 2 (ERBB2 vs rest), the use of all 46 genes gave a 
minimum of 9.7 error rate. Hence, all 46 were used in the 
predictor set. The predictor sets are shown in Table 6. 



Dataset 


Description 


Differentially 
regulated 
(2 fold) 


Top »N' 
genes 


Error 
rate 


1 


Luminal (19) vs Rest (43) 


116 


35 


3 


2 


ERBB2 (19) vs Rest (43) 


46 


46 


9.7 


3 


Basal (7) vs Rest (55) 


318 


20 


3 




ER negative type 2 (5) vs 
Rest (57) 


294 


111 


3 


5 


Normal and Normal like 
(12) vs Rest (50) 


188 


50 


3 



Step 6: The samples were all combined into one dataset and 
one vs all cross-validation analysis was carried out using 
the various predictor sets. 100 independent iterations of 
75:25 (training: test) random splits were used, resulting in 
an overall cross validation error rate of 5.25% (Overall 
accuracy 94%) . 



61 



WO 03/070979 PCT/GB03/00755 



B. Identification of a Minimal Geneset for Classification 
Using a Genetic Algorithm/Maximum Likelihood Discriminant 
(GA/MLHD) Approach 

The GA/MLHD approach is a different classification 
algorithm (Ooi & Tan, 2003) that serves as an alternative 
to the OVA SVM described in A. 

Step 1: Samples were broken down into the following 
classes : 



Class 


No. of 
samples 


ER- subtype II 


5 


ERBB2+ 


19 


Normal and 
Normal-like 


12 


Luminal 


19 


Basal 


7 



A truncated dataset of 1000 genes was then established by 
selecting genes that exhibited the largest standard 
deviation (SD) across all the samples. 

Step 2: 24 runs of the GA/MLHD algorithm were performed on 
the 62 breast cancer samples based on the class distinction 
described in Table 4. The accuracy of the predictor sets 
selected by the GA/MLHD algorithm were assessed by cross- 
validation and independent test studies. 

Details of GA/MLHD properties: 
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(a) Crossover rates: 0.7, 0.8, 0.9, 1.0. 

(b) Mutation rates: 0.0005, 0.001, 0.002, 0.0025, 
0.005, 0.01 

(c) Uniform crossover 

(d) Selection: stochastic uniform sampling 

(e) Predictor set size range: R min = 1 and R max = 80. 

30 optimal predictor sets with sizes ranging from 13 to 17 
genes per predictor set were obtained. Each predictor set 
was associated with a classification accuracy of 1 error 
out of 62 samples, (error rate: 1.61%, overall 
classification accuracy 98%) . 10 out of the 30 predictor 
sets wrongly classified the Luminal-A sample 980221T as a 
Normal sample. For the other 20 predictor sets, 19 
misclassified the ERBB2+ sample 990262T as a ER- subtype II 
sample, while 1 predictor set wrongly classified the same 
990262T sample as a Basal-type sample. Two of the optimal 
predictor sets are displayed in Table 6b. 

Identification of a Luminal D Subclass in the Asian Breast 
Cancer Population 

Previous breast cancer expression profiling studies done on 
primarily Caucasian populations revealed the existence of a 
* luminal' subtype characterized by the high expression of 
estrogen-receptor related genes such as ESRl, GATA3 , and 
LIV-1. Further, these 'luminal' cancers could be further 
subdivided into at least 2 further subtypes : Luminal A and 
Luminal B/C. While Luminal A tumors express very high 
levels of ER related genes, Luminal B/C cancers express 
intermediate levels of the ER gene cluster. Furthermore, 
luminal C tumors also express high levels of a 'novel' gene 
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cluster. Luminal B/C tumors were found to exhibit a worse 
clinical prognosis than Luminal A tumors, arguing that 
these subtypes are indeed clinically relevant. 

5 A similar study on breast cancers derived from Chinese 

patients performed in Singapore confirmed that the luminal 
A subtype is also present in the Asian patient population. 
However, the luminal B/C subtype was not detected. The 
reasons behind this difference may be due to methodological 
10 differences between the two studies or true differences in 

patient population. 

A careful inspection of the original Caucasian study by the 
inventors subsequently revealed that Luminal C tumors are 

15 also associated with high levels of a gene cluster whose 

members are involved in cellular proliferation. In 
contrast, this ^prolif eration cluster' is lowly expressed 
in Luminal A tumors. The high expression of genes in the 
'prolif eration cluster' may functionally contribute to the 

20 worse clinical prognosis associated with Luminal C tumors, 

as this high expression levels of this cluster is also seen 
in tumors belonging to the clinically aggressive ERBB2+ and 
basal (ER-) subtypes as well. Thus, although a luminal B/C 
subtype was not observed in the Asian breast cancer 

25 population, the inventors hypothesized that the genes in 

this ^proliferation' cluster could also be used to 
subdivide the previously homogenous Luminal A tumors found 
in the Asian population into distinct luminal subtypes. 

30 Results 
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Identification of 'proliferation cluster' linked-genes on 
the Affymetrix U133A Genechip 

In the inventor's study, the expression profiles of several 
5 breast tumors were obtained using commercially available 

Affymetrix U133A Genechips. Genes corresponding to the 
original 'proliferation' cluster members were then selected 
from the Genechip. Of the 65 genes comprising the original 
'proliferation cluster' , the inventors determined at 36 
10 (55%) were also present on the Genechip array. 

Discovery of a 'Luminal D' Subtype in the Asian Luminal 
Tumor Population 

15 The inventors then used this 36-geneset to recluster a 

group of tumors which in their previous analysis had been 
homogenously assigned to the Luminal A subtype. As seen in 
Figure 1, the 36-geneset strikingly divided the tumors into 
two broad groups chracterized by low and high levels of 

20 expression of the 36-geneset respectively. The former group 

is from henceforth referred to as the true 'luminal A' 
subtype, while the latter group is referred to as 'luminal 
D' , as its expression profile is distinct from previously 
identified subtypes. 

25 

High levels of expression of the 36-geneset is also 
observed in other aggressive tumor subtypes 

To determine if Luminal D tumors are also more clinically 
30 aggressive than Luminal A tumors, the inventors then 

determined if high expression levels of this cluster was 
also observed in aggressive tumors subtypes by reclustering 
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a larger series of their tumors using only the 36-gene 
'proliferation cluster'. As seen in Figure 2, Luminal D 
tumors intermixed with tumors of the ERBB2+ and Basal 
subtypes, while Luminal A tumors mixed with the normal and 
'normal-like' tumors. This result suggests that the Luminal 
D tumors may share certain hallmarks of more highly 
aggressive tumors, and that the Luminal D subtype may be 
clinically relevant. 

A 'Genetic Identifier' for the Luminal D Subtype 

The inventors then proceeded to develop a 'genetic 
identifier' for the Luminal D subtype. In this strategy, 
the 'genetic identifier' should only be applied to a tumor 
that has previously been characterized as Luminal in 
nature, for example by the other 'genetic identifiers' 
shown in Tables 5 and 6. 

Step 1: A series of expression profiles for 19 tumors which 
had been previously characterized as Luminal A were 
normalized by median centering each expression profile 
around 1000 flouresence units. 

Step 2: A 'Valid value' filter was applied such that genes 
that were at least 70% present (ie above a minimum 
threshold value, usually about 200) across all samples were 
chosen 
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Step 3: To divide the samples in a more robust fashion, a 
Principal Component Analysis (PCA) was then used to 
ascertain the Luminal A and D subgroups using the 36 
proliferation geneset (Figure 3) . 

Step 4: Using the Luminal A (12 samples) vs. Luminal D (7 
samples) groupings, genes were selected from the entire 
expression profile that exhibited a minimum 2 fold change 
between the two groups (Ratio of means was used to 
calculate the fold change between two groups) . Ill such 
genes were identified in this analysis. 

Step 5: A SVM gene ranking analysis was then performed for 
the Ill-gene dataset to rank genes in the order of their 
importance in assigning a luminal breast cancer sample into 
either the Luminal A or Luminal D subtypes. The top 45 
genes gave lowest error rate (about 12%) . 18 genes were up 
regulated in Luminal D and 27 were down regulated in 
luminal D. The genes are depicted in Table 7. 

Step 6: The accuracy of the 45-gene Genetic identifier was 
then assesed using leave one out cross validation. No 
misclassif ications were observed. 

Discussion 

One outstanding challenge of the post-genomic era is to 
translate the huge amounts of raw sequence data generated 
by various genome sequencing projects into applications 
that improve healthcare and the treatment of disease. One 
area which could be revolutionised by the availability of 
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these new resources is in the field of molecular 
diagnostics, where the pathologic classification of a 
tissue, in complementation to conventional histopathology , 
is also based upon a set of informative molecular markers. 
Importantly, one advantage of the molecular approach is 
that the resolving power of classification schemes based 
upon molecular data can be sufficiently sensitive to detect 
clinically relevant disease subtypes that have currently 
eluded traditional light microscropy approaches (Ash et 
al., 2000, Bittner et al . , 2000). 

However, before the potential of molecular diagnostics can 
fully realized, a number of challenges must be met and 
overcome. Firstly, for many common diseases, key 
informative genes that are able to discriminate between the 
relevant disease sub-classes in question must be 
identified. Secondly, in order to be feasibly utilized as 
part of a clinical assay, these genes must be 'pared' down 
to a minimal set ( v genetic identifiers') that collectively 
still delivers high predictive accuracy. Thirdly, because 
the clinical behaviour of many diseases can vary 
extensively amongst different ethnic groups and 
populations, it will be necessary to define appropriate 
limits of use of these ^genetic identifiers' for specific 
patient populations. 

To address these issues, the inventors have embarked upon a 
large-scale expression profiling project of breast tissues 
derived from Asian patients. _Previous reports have 
primarily focused on using samples derived from patients of 
primarily Caucasian origin (Perou et al . , 2000, Gruvberger 
et al., 2000, Hedenfalk et al., 2000), and it is essential 
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to determine if findings obtained from these studies will 
be applicable to other ethnic populations. This is 
especially so given the epidemiological and clinical 
differences in breast cancer between these distinct ethnic 
groups, in Caucasian populations, the majority of breast 
cancers tend to occur in post-menopausal women. However, 
in Singapore and Japan, the absolute number of breast 
cancer cases per year is roughly 1/3 that of the US and the 
incidence of breast cancer in these populations is bi-modal 
- the first peak, representing the majority of breast 
cancers, occurs in pre-menopausal women occurs at around 
the age of 40 (Chia et al., 2000). This first peak is then 
followed by a second peak at about age 55-60. The earlier 
incidence of breast cancer in Asian populations is unlikely 
to be due to earlier detection, as breast cancer screening 
programs in these countries are still relatively novel 
compared to Western countries. To explain these 
observations, one possibility may be that the breast 
cancers observed in these groups may represent distinct 
heterogenous subtypes arising from specific genetic or 
environmental differences. For example, it is known that 
the levels of estrogen and progesterone in Chinese women 
tend to be substantially lower than in Caucasians (Lippman, 
1998) . 

To ensure maximal diversity in the repertoire of expression 
profiles used in the inventors' analysis, the inventors 
selected samples derived from patients from a wide variety 
of demographic and clinical backgrounds, as well as tumours 
of varying grades and appearances. First, the inventors 
identified a 'genetic identifier' in breast cancer for what 
is perhaps the most basic distinction of clinical utility - 



69 



WO 03/070979 



PCT/GB03/00755 



i.e. distinguishing if a given sample is 'normal' or 
'malignant' . Although this distinction can be currently 
made by a qualified pathologist using conventional 
histopathology, the availability of such a molecular assay 
5 would still be of use in clinical settings where rapid 

diagnosis is required, or when a pathologist may not be 
readily available. By focusing on highly reproducible 
'outlier' genes in both normal and tumour datasets, the 
inventors identified a minimal set of 20 genes that is 

10 apparently able to accurately predict if an unknown breast 

sample is normal or malignant in both a training set and 
naive test set of comparable sample quantity. In addition, 
using principal component analysis, they were able to show 
that at the expression profiles of normal breast samples 

15 appears to be far less varied than their corresponding 

tumour profiles. In the field of breast cancer research, 
there are surprisingly relatively few reports in the 
literature that have directly addressed the question of 
distinguishing between normal and tumour tissues using the 

2 0 relatively unbiased manner afforded by the DNA microarray 
approach. In one major study, it was found that that the 
expression profiles of normal breast tissues were 
sufficiently similar for them to co-segregate with each 
other using an unsupervised clustering methodology (Perou 

25 et al., 2000). However, in that report, the investigators 

also found that the normal samples, rather than segregating 
as an independent branch distinct from the tumour samples, 
instead segregated within a broad tumour class originating 
from mammary epithelial cells of 'basal' or 'myoepithelial' 

30 origin. This result, most likely due to the similarity of 

genes that are expressed in normal tissues and tumours of 
this subclass, illustrates that it may not be trivial to 
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use purely unsupervised methodologies to discriminate 
between normal and tumour breast tissues. However, while 
this appears to be an issue for breast cancer genomics, it 
may not apply to other tissue types. For example, it 
appears that unsupervised clustering is able to 
discriminate between normal and malignant colon samples 
(Alon et al., 1999) . One reason for this may be that colon 
tumours, which primarily arise from disruption of the 
APC/p-catenin pathway, may be genetically more uniform than 
breast tumours. 

The genes involved in the 20-gene 'genetic identifier' 
belong to many different categories. Genes such as 
apolipoprotein D are well-known terminal differentiation 
genes in breast biology, while MAGED2 was previously 
isolated as a gene that is overexpressed in primary breast 
tumours, but not in normal mammary tissue or breast cancer 
cell lines (Kurt et al., 2000). Another gene, ITA3, which 
produces the alpha-3 subunit of the alpha-3/beta-l 
integrin, has been shown to be associated with mammary 
tumour metastasis (Morini et al., 2000). The CAV1 protein, 
which links integrin signaling to the Ras/ERK pathway, has 
also previously been identified as a potential tumour 
suppressor gene (Wary et al., 1998, Weichen et al., 2001), 
which may explain its expression in normal breast tissues 
but not tumours. In addition to genes with known roles in 
breast and tumour biology, other intriguing genes were 
identified whose role in turaourgenesis is unclear or not 
known. For example, thrombin, best known for its role in 
the coagulation cascade, has recently been shown to inhibit 
tumour cell growth, which may explain its expression in 
normal but not tumour breast samples (Huang et al., 2000). 
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Another example is the human homolog of the S. cerevisiae 
PWP2 gene, which in yeast plays an essential role in cell 
growth and separation (Shafaatian et al., 1996). 

To gain insights into the diversity of breast cancer 
molecular subtypes in the Asian population, the inventors 
then generated and analyzed a series of expression profiles 
of both invasive breast cancers and DCIS cancers. The aim 
of this work was to attempt to validate the molecular 
subtyping scheme defined in the Stanford study using 
another breast cancer expression dataset. By comparing 
their expression profiles to previously published studies 
performed using patient samples of primarily Caucasian 
origin, they found that the majority of molecular subtypes 
and hallmark expression signatures were robustly conserved 
between the two series. Although a similar validation 
study has recently been reported for prostate cancer 
(Rhodes et al., 2002), this report is the first time such a 
comparative analysis has been performed for breast cancer. 
The conservation of molecular subtypes between the two 
populations is all the more remarkable when one considers 
the many methodological differences existing between the 
studies. For example, one finding of interest was the 
inventors' ability to detect similar subtypes in both 
series despite the differences in array technology 
platform. This result is significant as there is currently 
conflicting data in the field regarding the feasibility of 
integrating data from different genomic expression 
technologies. For example, in Rhodes et al., (2002), it was 
reported that prostate cancer expression data from spotted 
cDNA arrays yielded similar data to oligonucleotide arrays. 
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In contrast, another recent report comparing the expression 
profiles of cell lines as measured by spotted and 
oligonucleotide arrays reported a very poor correlation 
between the studies (Kuo et al., 2002). The inventors' 
5 results suggest that data from different technology 

platforms can indeed be compared, so long as the subtype 
distinctions in question are fairly robust in nature. The 
inventors' results also suggest that despite the 
epidemiological differences in breast cancer between the 
10 Asian and Caucasian population (see beginning of 

Discussion) , that breast cancers between the ethnic groups 
are to a first approximation highly molecularly similar. 

The inventors also found that DCIS cancers robustly express 

15 many subtype-specific gene expression signatures, 

suggesting that these molecular subtypes can be discerned 
even at this pre-invasive stage. Thus, it is unlikely that 
these subtypes represent an evolving cancer class, but are 
distinct biological entities that may posses different 

20 tumorigenic origins. Despite the expression of subtype- 

specific expression signatures in DCIS cancers (as reported 
in this study) , there is other evidence in the field that 
DCIS cancers may be distinct from invasive cancers. For 
example, previous retrospective reports have shown that the 

25 majority of low nuclear grade DCIS tumors undergo a long 

clinical evolution to invasive cancer (Page et al., 1982; 
Betsill et al., 1978; and Rosen et al., 1980), suggesting 
that additional genetic events must occur before 
they become invasive. In addition, histopathological 

30 studies have found that there is a considerable difference 

in the histopathological distribution of tumor types in 
DCIS cancers vs invasive cancers, with ERBB2+ cancers being 
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much more highly represented in DCIS compared to invasive 
cases (Barnes et al., 1992). It has been unclear, however, 
if this observation should be interpreted to mean that that 
the ER-ERBB2- cancers lack a DCIS component, or if the 
5 ERBB2+ cancers will eventually evolve to a ERBB2- state. 

The distinctive segregation of the DCIS cancers in the 
inventors' series suggests that the former is true, since 
the ERBB2+ cancers already express many ERBB2+ invasive 
hallmarks . 

10 

Finally, by integrating the expression profiles of normal, 
DCIS, and invasive cancers belonging to the luminal A and 
ERBB2+ subtypes, the inventors were able to define sets of 
genes which were regulated in a common and subtype-specific 

15 manner during the normal, DCIS, and invasive cancer 
transitions. Although the results of these analyses 
clearly need to be supported by further experimental work 
before any definitive conclusions can be made, there were a 
number of intriguing observations. The inventors found that 

20 a number of components of the Wnt signaling pathway were 

commonly regulated during the transition from normal -> 
DCIS for both subtypes, implicating deregulation of Wnt 
signaling as an important common event in breast cancer 
carcinogenesis. Although previous reports have reported the 

25 involvement of the Wnt pathway in human breast cancer 

carcinogenesis (Smalley et al., 2001), it has been less 
clear if this is an early or late event. The inventors' 
results suggest the former possibility is more likely. 
Secondly, the remarkable commonality of genes regulated 

30 from the DCIS to the invasive stage between the two 

subtypes suggests that many of the genetic processes that 
underlie cellular invasion, desmoplastic reaction, stromal 
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remodeling etc, may be fairly general and shared across 
different breast cancer subtypes. Finally, the inventors' 
results also suggest that both cancer subtypes may be 
highly metabolically distinctive, with ERBB2+ tumors having 
5 a greater reliance on ionic-related processes, while 

Luminal A tumors may be under a state of chronic metabolic 
stress. These results are extremely important, for 
example, the increased metabolic load of Luminal A tumors 
may explain why ER+ tumors are more radiosensitive than ER- 
10 tumors (Villalobos et al . , 1996), and calcium signaling may 

play a role in tumor cell motility controlled by the ERBB2+ 
receptor (Feldner and Brandt (2002) . 
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Table 3 : Tabulation of expression signatures associated with 
breast tumor subtypes. Subclasses include Luminal A (L-A_, 
Luminal B (L-B) , Luminal C (L-C_, Basal (Bas) , Normal like 
(Nor) , ERBB2 (ERB) . Levels of expression are indicated by H 
(high expression), I (intermediate expression), and A (absent 
expression) . 



Expression Signature 


Unigene 


Tumor subtype 






L-A 


L-B 


L-C 


Bas 


Nor 


ERB 


Luminal Epithelium 




H 


I 


I 


A 


A 


A 


estrogen receptor 1 


Hs.1657 




GATA binding protein 3 


Hs. 169946 


LIV-1 


Hs.79136 


Xbox binding protein 1 


Hs.149923 


Hepatocyte Nuclear Factor 3 alpha 


Hs.299867 


Basal Epithelium 




A | A | A | H | H | A 


Keratin5 


Hs.195850 




Keratin 17 


Hs.2785 


Laminin gamma 2 


Hs.54451 


Fatty acid binding protein 7 


Hs.26770 


erbb2 related genes 




A | A | A | A | A | H 


C-ERB-B2 


Hs.323910 




GRB7 


Hs.86859 


TIAF1 


Hs.75822 


TRAF4 


Hs.8375 


Normal breast like 




A | A |A | A | H |A 


CD36 antigen collagen type 1 receptor 


Hs.75613 




Four and a half LIM domain 1 


Hs.239069 


vascular adhesion protein 1 


Hs.1 98241 


alcohol dehydrogenase 2 class 1 


Hs.4 


Novel 




A | A | H H A I 


kinesin-like 5 mitotic kinesin-Iike protein 1 


Hs.270845 




putative integral membrane transporter 


Hs.296398 


gamma-glutamyl hydrolase conjugase 


Hs.78619 


squalene epoxidase 


Hs.71465 
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Claims 

1. A method of creating an expression profile 
characteristic of a breast tumour cell, said method 

5 comprising the steps of 

(a) isolating expression products from said breast 
tumour cell and a normal breast cell; 

(b) contacting said expression products for both the 
tumour and normal breast cell with a plurality of binding 

10 members capable of specifically binding to expression 

products of one or more of the genes selected from Table 
2; so as to create an expression profile of those genes 
for both the tumour cell and the normal cell; 

(c) comparing the expression profile of the tumour 
15 cell and the normal cell; and 

(d) determining an expression profile characteristic 
of a breast tumour cell. 

2. A method of creating an expression profile 

20 characteristic of a breast tumour cell, said method 

comprising the steps of 

(a) isolating expression products from a breast 
tumour cell, contacting said expression products with a 
plurality of binding members capable of specifically and 

25 independently binding to expression products of a 

plurality of genes selected from Table 2; so as to create 
a first expression profile of a tumour cell; 

(b) isolating expression products from a normal 
breast cell; contacting said expression products with the 

30 plurality of binding members as used in step (a), so as 

to create a comparable second expression profile of a 
normal breast cell; and 
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(c) comparing the first and second expression 
profiles to determine an expression profile 
characteristic of a breast tumour cell. 



5 3. A method of creating a nucleic acid expression 

profile characteristic of a breast tumour cell, said 
method comprising the steps of 

(a) isolating expression products from a first 
breast tumour cell, contacting said expression products 

10 with a plurality of binding members capable of 

specifically and independently binding to expression 
products of a plurality of genes selected from Table 2, 
so as to create a first expression profile; 

(b) repeating step (a) with expression products 

15 from at least a second breast tumour cell so as to create 

at least a second expression profile; 

(c) comparing the at least first and second 
expression profiles to create a standard nucleic acid 
expression profile characteristic of a breast tumour 

20 cell. 



4 . A method according to any one of the preceding 
claims wherein the binding members are capable of 
specifically and independently binding to five or more 
25 genes selected from Table 2. 



5. A method according to any one of the preceding 
claims wherein the binding members are capable of 
specifically and independently binding to each of the 
30 genes provided in Table 2. 
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6. A method according to any one of the preceding 
claims wherein the expression product is mRNA or cDNA. 

7 . A method according to any one of the preceding 
claims wherein the binding members are nucleic acid 
probes . 

8. A method according to any one of claims 1 to 5 
wherein the expression product is a polypeptide. 

9. A method according to claim 8 wherein the binding 
members are antibody binding domains. 

10. A method according to any one of the preceding 
claims wherein the binding members are labelled. 

11. A method according to any one of claims 1 to 9 
wherein the expression products are labelled. 

12. A method for determining the presence or risk of 
breast cancer in an individual, said method comprising 

(a) obtaining expression products from a breast 
tissue cell obtained from an individual suspected of 
having or at risk from having breast cancer; 

(b) contacting said expression products with binding 
members capable of specifically and independently binding 
to expression products corresponding to a plurality of 
the genes identified in Table 2; and 

(c) determining the presence or risk of breast 
cancer in said individual based on the binding of the 
expression products from said breast tissue cell to one 
or more of the binding members. 
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13. A method according to claim 12 wherein the binding 
members are capable of binding to expression products 
corresponding to at least five of the genes identified in 

5 Table 2. 

14. A method according to claim 12 or claim 13 wherein 
the binding members are capable of binding to expression 
products corresponding to each of the genes identified in 

10 Table 2. 

15. A method according to any one of claims 12 to 14 
wherein the determination of the presence or risk of 
breast cancer in said individual is carried out by 

15 comparing the binding of the expression products from the 

breast tissue cell under test with an expression profile 
characteristic of breast tumour cell. 

16. A method according to claim 15 wherein said 

20 expression profile characteristic of a breast tumour cell 

is created by a method according to any one of claims 1 
to 11. 

17. A method according to any one of claims 12 to 16 
25 wherein the individual is of Asian descent. 

18. A method of creating a nucleic acid expression 
profile characteristic of a breast tumour cell, said 
method comprising the steps of 

30 (a) isolating expression products from said breast 

tumour cell and a normal breast cell; 
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(b) contacting said expression products for both the 
tumour and normal breast cell with a plurality of binding 
members capable of specifically binding to expression 
products of a plurality of genes selected from Table 4a; 

5 so as to create an expression profile of those genes for 

both the tumour cell and the normal cell; 

(c) comparing the expression profile of the tumour 
cell and the normal cell; and 

(d) determining a nucleic acid expression profile 
10 characteristic of breast tumour cell. 

19. A method of creating a nucleic acid expression 
profile characteristic of a breast tumour cell, said 
method comprising the steps of 

15 (a) isolating expression products from a breast 

tumour cell; contacting said expression products with a 
plurality of binding members capable of specifically and 
independently binding to expression products of a 
plurality of genes selected from Table 4a; so as to 

20 create a first expression profile of a tumour cell; 

(b) isolating expression products from a normal 
breast cell; contacting said expression products with the 
plurality of binding members as used in step (a) ; so as 
to create a comparable second expression profile of a 

25 normal breast cell; 

(c) comparing the first and second expression 
profiles to determine an expression profile 
characteristic of a breast tumour cell. 

30 20. A method according to claim 18 or claim 19 wherein 

the said plurality of genes are selected from Table 4b. 
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21. A method according to claim 19 wherein at least five 
genes are selected from Table 4a. 

22. A method according to claim 19 wherein at least 
5 twenty genes are selected from Table 4a. 

23. A method according to claim 19 wherein the plurality 
of genes comprise at least those provided in Table 4b. 

10 24-. A method according to any one of claims 18 to 23 

wherein the expression product is mRNA or cDNA. 

25. A method according to any one of claims 18 to 23 
wherein the binding members are nucleic acid probes. 

15 

26. A method according to any one of claims 18 to 23 
wherein the expression product is a polypeptide. 

27 . A method according to claim 26 wherein the binding 
20 members are antibody binding domains. 

28. A method according to any one of claims 18 to 27 
wherein the binding members are labelled. 

25 29. A method according to any one of claims 18 to 27 

wherein the expression products are labelled. 

30. A method for determining the presence or risk of 
breast cancer in an individual, said method comprising 
30 (a) obtaining expression products from a breast 

tissue cell obtained from an individual suspected of 
having or at risk from having breast cancer; 
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(b) contacting said expression products with binding 
members capable of binding to expression products 
corresponding to a plurality of genes identified in Table 
4a; and 

(c) determining the presence or risk of breast 
cancer in said individual based on the binding of the 
expression products from said breast tissue cell to one 
or more of the binding members. 

31. A method according to claim 30 wherein at least five 
genes are selected from Table 4a. 

32. A method according to claim 30 wherein at least 
twenty genes are selected from Table 4a. 

33. A method according to claim 23 wherein the plurality 
of genes are at least those identified in Table 4b. 

34. A method according to any one of claims 30 to 33 or 
claim 24 wherein the determination of the presence or 
risk of breast cancer in said individual is carried out 
by comparing the binding of the expression products from 
the breast tissue cell under test with an expression 
profile characteristic of breast tumour cell. 

35. A method according to claim 34 wherein said 
expression profile characteristic of a breast tumour cell 
is created by a method according to any one of claims 18 
to 29. 

36. A method according to any one of claims 30 to 35 
wherein the determination of the presence or risk of 
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breast cancer is computed using an algorithm which 
distinguishes a tumour cell from normal cell by their 
respective expression profiles. 

37 . A method of obtaining a plurality of gene expression 
profiles in order to determine a standard expression 
profile characteristic of presence and/or type of breast 
cancer, said method comprising 

a) obtaining cells from a plurality of breast tumour 
sample; 

b) . disrupting said cells to expose gene expression 
products ; 

c) contacting said gene expression products with a 
plurality of binding members specific for expression 
products of one or more genes selected from Table 2; and 

d) determining a gene expression profile 
characteristic of the presence and/or type of breast 
cancer based on the binding of said expression products 
to said binding members for each of said plurality of 
breast tumour samples. 

38. A method of obtaining a plurality of gene expression 
profiles in order to determine a standard expression 
profile characteristic of presence and/or type of breast 
cancer, said method comprising 

a) obtaining cells from a plurality of breast tumour 
sample; 

b) disrupting said cells to expose gene expression 
products; 

c) contacting said gene expression products with a 
plurality of binding members specific for expression 
products of one or more genes selected from Table 4a; and 
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d) determining a gene expression profile 
characteristic of the presence and/or type of breast 
cancer based on the binding of said expression products 
to said binding members for each of said plurality Of 
5 breast tumour samples. 

39. A method of obtaining a plurality of gene expression 
profiles in order to determine a standard expression 
profile characteristic of presence and/or type of breast 
cancer, said method comprising 

a) obtaining cells from a plurality of breast tumour 
sample; 

b) disrupting said cells to expose gene expression 
products; 

c) contacting said gene expression products with a 
plurality of binding members specific for expression 
products of one or more genes selected from Table 4b; and 

d) determining a gene expression profile 
characteristic of the presence and/or type of breast 
cancer based on the binding of said expression products 
to said binding members for each of said plurality of 
breast tumour samples . 

40. A method of obtaining a plurality of gene expression 
25 profiles in order to determine a standard expression 

profile characteristic of presence and/or type of breast 
cancer, said method comprising 

a) obtaining cells from a plurality of breast tumour 
sample; 

30 b) disrupting said cells to expose gene expression 

products ; 
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c) contacting said gene expression products with a 
plurality of binding members specific for expression 
products of one or more genes selected from Table 5; and 

d) determining a gene expression profile 
characteristic of the presence and/or type of breast 
cancer based on the binding of said expression products 
to said binding members for each of said plurality of 
breast tumour samples. 

41. A method of obtaining a plurality of gene expression 
profiles in order to determine a standard expression 
profile characteristic of presence and/or type of breast 
cancer, said method comprising 

a) obtaining cells from a plurality of breast tumour 
sample; 

b) disrupting said cells to expose gene expression 
products; 

c) contacting said gene expression products with a 
plurality of binding members specific for expression 
products of one or more genes selected from Table 6a; and 

d) determining a gene expression profile 
characteristic of the presence and/or type of breast 
cancer based on the binding of said expression products 
to said binding members for each of said plurality of 
breast tumour samples . 

42. A method of obtaining a plurality of gene expression 
profiles in order to determine a standard expression 
profile characteristic of presence and/or type of breast 
cancer, said method comprising 

a) obtaining cells from a plurality of breast tumour 
sample; 
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b) disrupting said cells to expose gene expression 
products; 

c) contacting said gene expression products with a 
plurality of binding members specific for expression 

5 products of one or more genes selected from Table 7; and 

d) determining a gene expression profile 
characteristic of the presence and/or type of breast 
cancer based on the binding of said expression products 
to said binding members for each of said plurality of 

10 breast tumour samples. 

43. A method of obtaining a plurality of gene expression 
profiles in order to determine a standard expression 
profile characteristic of presence and/or type of breast 
15 cancer, said method comprising 

a) obtaining cells from a plurality of breast tumour 
sample; 

b) disrupting said cells to expose gene expression 
products; 

20 c) contacting said gene expression products with a 

plurality of binding members capable of specifically and 
independently binding to expression products of the genes 
identified in Table 6b; 

d) determining a gene expression profile 

25 characteristic of the presence and/or type of breast 

cancer based on the binding of said expression products 
to said binding members for each of said plurality of 
breast tumour samples. 

30 44. A method according to any one of claims 37 to 43 

further comprising the step of producing a database 
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containing a plurality of expression profiles obtained 
from said plurality of breast tumour samples. 

45. A method according to any one of claims 37 to 43 
further comprising the step of determining the 
statistical variation between the plurality of expression 
profiles . 

46. A database comprising expression profiles 
characteristic of breast cancer or type of breast cancer 
produced by a method according to claim 37 or claim 45. 

47. A database according to claim 4 6 wherein the 
expression profiles are nucleic acid expression profiles. 

48. A database according to claim 46 wherein the 
expression profiles are protein expression profiles. 

49. A method for classifying a breast tumour cell on the 
basis of Estrogen receptor (ER) status, said method 
comprising 

(a) obtaining expression products from a breast 
tumour cell; 

(b) contacting said expression products with binding 
members capable of binding to expression products 
corresponding to the genes identified in Table 5a; and 

(c) classifying the breast tumour on the basis of ER 
status based on the binding of the expression products 
from said breast tumour cell to one or more of the 
binding members. 
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50. A method for classifying a breast tumour cell on the 
basis of ERBB2 status, said method comprising 

(a) obtaining expression products from a breast 
tumour cell; 

(b) contacting said expression products with binding 
members capable of binding to expression products 
corresponding to the genes identified in Table 5b; and 

(c) classifying the breast tumour on the basis of 
ERBB2 status based on the binding of the expression 
products from said breast tumour cell to one or more of 
the binding members. 

51. A method for classifying a breast tumour cell on the 
basis of its molecular subtype, said method comprising 

(a) obtaining expression products from a breast 
tumour cell; 

(b) contacting said expression products with binding 
members capable of binding to expression products 
corresponding to a plurality of genes identified in Table 
6a; and 

(c) classifying the tumour cell with regard to its 
molecular subtype based on the binding profile of the 
expression products from the tumour cell and the binding 
members . 

52. A method according to claim 51 wherein the binding 
members are capable of specifically and independently 
binding to at least 5 genes identified in Table 6a. 

53. A method according to claim 51 wherein the binding 
members are capable of specifically and independently 
binding to at least twenty genes identified in Table 6a. 
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54. A method according to claim 51 wherein the binding 
members are capable of specifically and independently 
binding to at least the genes identified in Table 6b. 

55. A method according to any one of claims 51 to 54 
wherein the molecular subtypes are selected from Luminal, 
ERBB2, Basal, ER-type II and normal/normal-like. 

56. A method for classifying a breast tumour cell on the 
basis of its Luminal sub-class, said method comprising 

(a) obtaining expression products from a breast 
tumour cell; 

(b) contacting said expression products with 
binding members capable of binding to expression products 
corresponding to a plurality of genes identified in Table 
7; and 

(c) classifying the tumour cell with regard to its 
Luminal sub-class based on the binding profile of the 
expression products from the tumour cell and the binding 
members . 

57. A method according to claim 56 wherein said tumour 
cell has been previously classified as a Luminal 
molecular subtype by a method according to any one of 
claims 51 to 55. 

58. A method according to claim 56 or claims 57 wherein 
the Luminal sub-class is Luminal D or Luminal A. 

59. A diagnostic tool comprising a plurality of binding 
members capable of specifically and independently binding 
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to expression products of a plurality of genes selected 
from Table 4a, said plurality of binding members being 
fixed to a solid support. 

60. A diagnostic tool comprising a plurality of binding 
members capable of specifically and independently binding 
to expression products of a plurality of genes selected 
from Table 4b, said plurality of binding members being 
fixed to a solid support. 

61. A diagnostic tool comprising a plurality of binding 
members capable of specifically and independently binding 
to expression products of a plurality of genes selected 
from Table 5a, said plurality of binding members being 
fixed to a solid support. 

62. A diagnostic tool comprising a plurality of binding 
members capable of specifically and independently binding 
to expression products of a plurality of genes selected 
from Table 5b, said plurality of binding members being 
fixed to a solid support. 

63. A diagnostic tool comprising a plurality of binding 
members capable of specifically and independently binding 
to expression products of a plurality of genes selected 
from Table 6a, said plurality of binding members being 
fixed to a solid support. 

64. A diagnostic tool comprising a plurality of binding 
members capable of specifically and independently binding 
to expression products of a plurality of genes selected 
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from Table 7, said plurality of binding members being 
fixed to a solid support. 

65. A diagnostic tool comprising a plurality of binding 
members capable of specifically and independently binding 
to expression products of the genes identified in Table 
6b, said plurality of binding members being fixed to a 
solid support. 

66. A diagnostic tool according to any one of claims 59 
to 65 wherein said binding members are cDNA or 
oligonucleotides. 
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characteristic of breast tumour, methods for determining the 
presence or risk of breast cancer in an individual, using 
expression product(s) corresponding to the F2 gene 



2. claims: 1-45 (all partially; see remark below) 
Inventions 2-573 



methods of creating/obtaining expression profile 
characteristic of breast tumour, and/or methods for 
determining the presence or risk of breast cancer in an 
individual, and/or methods for classifying breast tumour 
cells using expression product(s) corresponding to at least 
a breast cancer related gene, and/or diagnostic tools 
comprising said expression product(s), 



wherein said gene is: 



-for invention 2: NCKAP1 gene 
-for invention 3: PWP2H gene 

-for inventions 4-573: CRYAB gene-gene corresponding to 
GenBank no. NM 016640 (as listed in tables 2, 4a, 5a, 5b, 6 
and 7) 



